Notes for DMDWH -Module1
Notes for DMDWH -Module1
Module-1
Introduction, Why Data Mining? What is Data Mining, Definition, KDD, Challenge, Data Ming
Tasks, Data Preprocessing, Data Cleaning, Missing Data, Dimensionality, Reduction, Feature
Subset Selection, Discretization and Binaryzation, Data Transformation, Measures of similarity
and Dissimilarity-Basics
There is a huge amount of data available in the Information Industry. This data is of no
use until it is converted into useful information. It is necessary to analyse this huge
amount of data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation,
Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are
over,
we would be able to use this information in many applications such as Fraud Detection,
Market Analysis, Production Control, Science Exploration, etc.
Need for Data Mining
Growth of OLAP data: The first database systems were implemented in the 1960’s and
1970’s. Many enterprises therefore have more than 30 years of experience in using
database systems and they have accumulated large amounts of data during that time.
Growth of data due to cards: The growing use of credit cards and loyalty cards is an
important area of data growth. In USA, there has been a tremendous growth in the use of
loyalty cards. Even in Australia, the use of cards like, FlyBuys has grown considerably.
Growth in data due to the web: E-commerce developments have resulted in information
about visitors to Web sites being captures, once again resulting in mountains of data for
some companies.
Growth in data due to other sources: There are many other sources of data.
Some of them are:
Telephone Transactions
Frequent flyer transactions
Medical transactions
Immigration and customs transactions
Banking transactions
Motor vehicle transactions
Utilities (eg electricity and gas) transactions
Shopping transactions
Growth in data storage capacity: Another way of illustrating data growth is to
consider
annual disk storage sales over the last few years.
Decline in the cost of processing: The cost of computing hardware has declined rapidly
the last 30 years coupled with the increase in hardware performance. Not only do the
prices for processors continue to decline, but also the prices for computer peripherals
have also been declining.
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications.
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid.
Listed below are the various fields of market where data mining is used −
Customer Profiling− Data mining helps determine what kind of people buy what
kind of products.
Identifying Customer Requirements− Data mining helps in identifying the best s
products for different customers. It uses prediction to find the factors that may attract new
customers.
Cross Market Analysis− Data mining performs Association/correlations between
product sales.
Target Marketing− Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern− Data mining helps in determining
customer purchasing pattern.
Finance Planning and Asset Evaluation− It involves cash flow analysis and
spending.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
Knowledge Discovery in Databases (KDD)
What is Knowledge Discovery?
Some people don’t differentiate data mining from knowledge discovery while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −
Data Cleaning− In this step, the noise and inconsistent data is removed.
Data Selection− In this step, data relevant to the analysis task are retrieved from
the database.
Data Mining− In this step, intelligent methods are applied in order to extract data
patterns.
patterns, providing and refining data mining requests based on the returned results.
Data mining query languages and ad hoc data mining− Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with a
data warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results− Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
Handling noisy or incomplete data− The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be poor.
Performance Issues
There can be performance-related issues such as follows −
Parallel, distributed, and incremental mining algorithms− The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Descriptive
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is
the list of descriptive functions −
Class/Concept Description
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be derived
by the following two ways
Data Characterization− This refers to summarizing data of class under study. This class
under study is called as Target Class.
Data Discrimination− It refers to the mapping or classification of a class with
some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
Frequent Item Set− It refers to a set of items that frequently appear together, for example, milk
and bread.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification− It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
can also be used for identification of distribution trends based on available data.
Outlier Analysis− Outliers may be defined as the data objects that do not comply
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is
a proven method of resolving such issues. Data preprocessing prepare raw data for
further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
Data goes through a series of steps during preprocessing:
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and
Data Reduction: This step aims to present a reduced representation of the data in
a data warehouse.
Data Cleaning
Quality of your data is critical in getting to final analysis. Any data which tend to be
incomplete, noisy and inconsistent can affect your result.
Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
Some data cleaning methods
Missing data
1. You can ignore the tuple. This is done when class label is missing. This method is not
very effective, unless the tuple contains several attributes with missing values.
2. You can fill in the missing value manually. This approach is effective on small data set
with some missing values.
3. You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4. You can use the attribute mean to fill in the missing value. For example, customer
average income is 25000 then you can use this value to replace missing value for income.
5. Use the most probable value to fill in the missing value.
Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due to
faulty data collection instruments, data entry problems and technology limitation.
How to Handle Noisy Data?
Binning
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The sorted values are distributed into a number of “buckets,” or bins.
For example
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins
Bin a: 4, 8, 15
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3Smoothing by bin means
Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin boundaries
Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
Regression
Data can be smoothed by fitting the data into a regression functions.
Clustering
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters. Values that fall outside of the set of clusters may be considered outliers.
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet
transforms and principal components analysis which transform or project the original data
onto a smaller space. Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and
removed.
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients. The
two vectors are of the same length. When applying this technique to data reduction, we consider
each tuple as an n-dimensional data vector, that is, X = (x1,x2,...,xn), depicting n measurements
made on the tuple from n database attributes.
“How can this technique be useful for data reduction if the wavelet transformed data are of the
same length as the original data?”
The usefulness lies in the fact that the wavelet transformed data can be truncated. A compressed
approximation of the data can be retained by storing only a small fraction of the strongest of the
wavelet coefficients. For example, all wavelet coefficients larger than some user-specified
threshold can be retained. All other coefficients are set to 0. The resulting data representation is
therefore very sparse, so that operations that can take advantage of data sparsity are
computationally very fast if performed in wavelet space. The technique also works to remove noise
without smoothing out the main features of the data, making it effective for data cleaning as well.
Given a set of coefficients, an approximation of the original data can be constructed by applying
the inverse of the DWT used.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique
involving sines and cosines. In general, however, the DWT achieves better lossy compression. That
is, if the same number of coefficients is retained for a DWT and a DFT of a given data vector, the
DWT version will provide a more accurate approximation of the original data. Hence, for an
equivalent approximation, the DWT requires less space than the DFT. Unlike the DFT, wavelets
are quite localized in space, contributing to the conservation of local detail. There is only one DFT,
yet there are several families of DWTs. Figure shows some wavelet families. Popular wavelet
transforms include the Haar-2, Daubechies-4, and Daubechies-6. The general procedure for
applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data at
each iteration, resulting in fast computational speed. The method is as follows:
1. The length, L, of the input data vector must be an integer power of 2. This condition can be
met by padding the data vector with zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first applies some data smoothing, such
as a sum or weighted average. The second performs a weighted difference, which acts to bring
out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements
(x2i, x2i+1). This results in two data sets of length L/2. In general, these represent a smoothed
or low-frequency version of the input data and the high frequency content of it, respectively.
4. The two functions are recursively applied to the data sets obtained in the previous loop, until
the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the previous iterations are designated the
wavelet coefficients of the transformed data.
FIG 1.5
Equivalently, a matrix multiplication can be applied to the input data in order to obtain the wavelet
coefficients, where the matrix used depends on the given DWT. The matrix must be orthonormal,
meaning that the columns are unit vectors and are mutually orthogonal, so that the matrix inverse
is just its transpose. This property allows the reconstruction of the data from the smooth and
smooth-difference data sets. Wavelet transforms can be applied to multidimensional data such as a
data cube. This is done by first applying the transform to the first dimension, then to the second,
and so on. The computational complexity involved is linear with respect to the number of cells in
the cube. Wavelet transforms give good results on sparse or skewed data and on data with ordered
attributes. Lossy compression by wavelets is reportedly better than JPEG compression, the current
commercial standard. Wavelet transforms have many real-world applications, including the
compression of fingerprint images, computer vision, analysis of time-series data, and data cleaning
3. The principal components are sorted in order of decreasing “significance” or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance. That is, the sorted axes are such that the first axis shows the most
variance among the data, the second axis shows the next highest variance, and so on. For example,
Figure 1.6 shows the first two principal components, Y1 and Y2, for the given set of data originally
mapped to the axes X1 and X2. This information helps identify groups or patterns within the data
4. Because the components are sorted in decreasing order of “significance,” the data size can be
reduced by eliminating the weaker components, that is, those with low variance. Using the
strongest principal components, it should be possible to reconstruct a good approximation of the
original data. PCA can be applied to ordered and unordered attributes, and can handle sparse
data and skewed data. Multidimensional data of more than two dimensions can be handled by
reducing the problem to two dimensions. Principal components may be used as inputs to multiple
regression and cluster analysis. In comparison with wavelet transforms, PCA tends to be better at
handling sparse data, whereas wavelet transforms are more suitable for data of high
dimensionality
Fig:1.6
Attribute Subset Selection Data sets for analysis may contain hundreds of attributes, many of which
may be irrelevant to the mining task or redundant. For example, if the task is to classify customers
based on whether or not they are likely to purchase a popular new CD at All Electronics when
notified of a sale, attributes such as the customer’s telephone number are likely to be irrelevant,
unlike attributes such as age or music taste. Although it may be possible for a domain expert to
pick out some of the useful attributes, this can be a difficult and time consuming task, especially
when the data’s behavior is not well known. (Hence, a reason behind its analysis!) Leaving out
relevant attributes or keeping irrelevant attributes may be detrimental, causing confusion for the
mining algorithm employed. This can result in discovered patterns of poor quality. In addition, the
added volume of irrelevant or redundant attributes can slow down the mining process.
Feature Subset Selection
Attribute subset selection or Feature Subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions). The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting probability distribution of the data classes is as
close as possible to the original distribution obtained using all attributes. Mining on a reduced set
of attributes has an additional benefit: It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier to understand. “How can we find a ‘good’
subset of the original attributes?” For n attributes, there are 2n possible subsets. An exhaustive
search for the optimal subset of attributes can be prohibitively expensive, especially as n and the
number of data classes increase. Therefore, heuristic methods that explore a reduced search space
are commonly used for attribute subset selection. These methods are typically greedy in that, while
searching through attribute space, they always make what looks to be the best choice at the time.
Their strategy is to make a locally optimal choice in the hope that this will lead to a globally optimal
solution. Such greedy methods are effective in practice and may come close to estimating an
optimal solution. The “best” (and “worst”) attributes are typically determined using tests of
statistical significance, which assume that the attributes are independent of one another. Many
other attribute evaluation measures can be used such as the information gain measure used in
building decision trees for classification.
Basic heuristic methods of attribute subset selection include the techniques that follow, some of
which are illustrated in Figure
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the procedure
selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally
intended for classification. Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses
the “best” attribute to partition the data into individual classes. When decision tree induction is
used for attribute subset selection, a tree is constructed from the given data. All attributes that do
not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes. The stopping criteria for the methods may vary. The procedure
may employ a threshold on the measure used to determine when to stop the attribute selection
process. In some cases, we may want to create new attributes based on others. Such attribute
construction6 can help improve accuracy and understanding of structure in high dimensional data.
For example, we may wish to add the attribute area based on the attributes height and width. By
combining attributes, attribute construction can discover missing information about the
relationships between data attributes that can be useful for knowledge discovery.
Fig,.1.7
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. Binning methods
are also used as discretization methods for data reduction and concept hierarchy generation. For
example, attribute values can be discretized by applying equal-width or equal-frequency binning,
and then replacing each bin value by the bin mean or median, as in smoothing by bin means or
smoothing by bin medians, respectively. These techniques can be applied recursively to the
resulting partitions to generate concept hierarchies. Binning does not use class information and is
therefore an unsupervised discretization technique. It is sensitive to the user-specified number of
bins, as well as the presence of outliers.
To discretize a numeric attribute, A, the method selects the value of A that has the minimum
entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical
discretization. Such discretization forms a concept hierarchy for A. Because decision tree–based
discretization uses class information, it is more likely that the interval boundaries (split-points)
are defined to occur in places that may help improve classification accuracy. Measures of
correlation can be used for discretization.
Data Transformation
Overview in data transformation, the data are transformed or consolidated into forms appropriate
for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0. 5. Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual
labels (e.g., youth, adult, senior).
5. The labels, in turn, can be recursively organized into higher-level concepts, resulting in
a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level. Recall that there is much overlap between the major data
preprocessing tasks. Smoothing is a form of data cleaning on the data cleaning process
also discussed ETL tools, where users specify transformations to correct data
inconsistencies. Discretization techniques can be categorized based on how the
discretization is performed, such as whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class
information, then we say it is supervised discretization. Otherwise, it is unsupervised. If
the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals,
it is called top-down discretization or splitting. This contrasts with bottom-up
discretization or merging, which starts by considering all of the continuous values as
potential split-points, removes some by merging neighbourhood values to form interval.
and then recursively applies this process to the resulting intervals. Data discretization and
concept hierarchy generation are also forms of data reduction. The raw data are replaced
by a smaller number of interval or concept labels. This simplifies the original data and
makes the mining more efficient. The resulting patterns mined are typically easier to
understand. Concept hierarchies are also useful for mining at multiple abstraction levels.
Discretization techniques can be categorized based on how the discretization is
performed, such as whether it uses class information or which direction it proceeds (i.e.,
top-down vs. bottom-up). If the discretization process uses class information, then we say
it is supervised discretization. Otherwise, it is unsupervised. If the process starts by first
finding one or a few points (called split points or cut points) to split the entire attribute
range, and then repeats this recursively on the resulting intervals, it is called top-down
discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some
by merging neighbourhood values to form intervals, and then recursively applies this
process to the resulting intervals. Data discretization and concept hierarchy generation are
also forms of data reduction. The raw data are replaced by a smaller number of interval or
concept labels. This simplifies the original data and makes the mining more efficient. The
resulting patterns mined are typically easier to understand. Concept hierarchies are also
useful for mining at multiple abstraction levels.
Min-max normalization preserves the relationships among the original data values. It will
encounter an “out-of-bounds” error if a future input case for normalization falls outside of the
original data range for A.
Example: Min-max normalization. Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is transformed to
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi , of
A is normalized to v 0 i by computing
where A¯ and σA are the mean and standard deviation, respectively, of attribute A. The
mean and standard deviation, where vn ) and σA is
computed as the square root of the variance of A. This method of normalization is useful when
the actual minimum and maximum of attribute A are unknown, or when there are outliers that
dominate the min-max normalization. z-score normalization.
Suppose that the mean and standard deviation of the values for the attribute income are $54,000
and $16,000, respectively. With z-score normalization, a value of $73,600 for income is
transformed to A variation of this z- score normalization replaces
the standard deviation by the mean absolute deviation of A. The mean absolute deviation of
A, denoted sA, is
The mean absolute deviation, sA, is more robust to outliers than the standard deviation,
σA. When computing the mean absolute deviation, the deviations from the mean (i.e.,
are not squared; hence, the effect of outliers is somewhat reduced. Normalization
by decimal scaling normalizes by moving the decimal point of values of attribute A. The number
of decimal points moved depends on the maximum absolute value of A. A value, vi of A is
normalized to v0 i by computing.
Decimal scaling
Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917. Note
that normalization can change the original data quite a bit, especially when using z-score
normalization or decimal scaling. It is also necessary to save the normalization
parameters (e.g., the mean and standard deviation if using z-score normalization) so that
future data can be normalized in a uniform manner.
Dissimilarity Measure
Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
d (p, r) ≤d (p, q) +d (q, r) for all p, q, and r, where d (p, q) is the distance
Euclidean Distance
Assume that we have measurements xik, i= 1, … ,N, on variable sk= 1, … ,p(also called
attributes). The Euclidean distance between the ith and jth objects is
dE(i,j)=(p∑k=1(xik−xjk)2)12dE(i,j)=(∑k=1p(xik−xjk)2)12
for every pair (i, j) of observations.
The weighted Euclidean distance is
dWE(i,j)=(p∑k=1Wk(xik−xjk)2)12dWE(i,j)=(∑k=1pWk(xik−xjk)2)12
If scales of the attributes differ substantially, standardization is necessary.
Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik,i= 1, … ,N, k= 1, … ,p, the Minkowski distance is
dM(i,j)=(p∑k=1∣∣xik−xjk∣∣λ)1λ,dM(i,j)=(∑k=1p|xik−xjk|λ)1λ,
where λ ≥ 1.It is also called the Lλmetric.
Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
xTi=(xi1,...,xip)xiT=(xi1,...,xip)
The Mahalanob is distance is
dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12
where ∑ is the p×p sample covariance matrix