UNIT 1 Introduction of Data Mining
UNIT 1 Introduction of Data Mining
4. Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data
sets, clustering analyzes data objects without consulting class labels. In many
cases, class labeled data may simply not exist at the beginning. Clustering can be
used to generate class labels for a group of data. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity. That is, clusters of objects are formed so that
objects within a cluster have high similarity in comparison to one another, but are
rather dissimilar to objects in other clusters. Each cluster so formed can be viewed
as a class of objects, from which rules can be derived. Clustering can also
facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
Example 1.9
Cluster analysis. Cluster analysis can be performed on All Electronics customer
data to identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing. Figure 1.10 shows a 2-D plot of
customers with respect to customer locations in a city. Three clusters of data
points are evident
5. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Many data mining methods
discard outliers as noise or exceptions. However, in some applications (e.g., fraud
detection) the rare events can be more interesting than the more regularly
occurring ones. The analysis of outlier data is referred to as outlier analysis or
anomaly mining. Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance measures where
objects that are remote from any other cluster are considered outliers. Rather than
using statistical or distance measures, density-based methods may identify
outliers in a local region, although they look normal from a global statistical
distribution view
Example 1.10 Outlier analysis. Outlier analysis may uncover fraudulent usage of
credit cards by detecting purchases of unusually large amounts for a given
account number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the locations and types of
purchase, or the purchase frequency.