Data Analytics Full PDF
Data Analytics Full PDF
H(x)
Core
X
0
Support
Boundary
Boundary
Fig. 2.24.1. Core, support,and boundariesof a fuzzy set.
3. Boundaries:
a. The boundaries of a membership function for some fuzzy set A
are defined as that region of the universe containing elements that
have a non-zero membership but not complete membership.
b. The boundaries comprise those elements x of the universe such
that 0< H (x) < 1.
Answer
FuzzyInference:
1. Inferences is a technique where facts, premises F, P ,and a
goal Gis to be derived from a given set.
2. Fuzzy inference is the process of formulating the mapping fromagiven
input to an output using fuzzy logic.
3. The mapping then provides a basis from which decisions can be made.
is A
y is B
Here oX (x,y)
=
y is B
x is A'
2. The membership of A' is computed as
A =® oR x, y)
3. In terms of membership function
u(x) = max (min ( (y), u,(x,y))
Answe
1 Decision trees are one of the most popular methods for learning and
reasoning from instances.
2. Given a set of n input-outputtraining patterns D {(X, y') i =
1,.., n]
=
where each training pattern X has been described by a set ofp conditional
(or input) attributes (x,.)and one corresponding discrete class label
where y' e |1,..,g) and q is the number
of classes.
3. The decision attribute y' represents a posterior knowledge
the class of each pattern.
regarding
4. An arbitrary class has been indexed by l (1</< q) and each class l has
been modeled as a crisp set.
5. The membership degree of the ih value of the decision attribute y'
concerning the ith class is defined as follows
Fuzzy partitioning|
Training
data
Fuzzy ID3
Fuzzy classification
rules
Test
Data Product-product-
sum reasoning
Classification
accuracy
Procedure:
While there exist candidate nodes
DO Select one of them using a search strategy.
Generate its child-nodes according to an expanded attribute obtained by
the given heuristic.
2-26 J (CS-5/T-6) Data Analysis
Answer
1. Grid-based rule sets model each input variable through a usually small
set of linguistic values.
2. The resulting rule base uses all or a subset of all possible combinations
of these linguistic values for each variable, resulting in a global
ulation of the feature space into "tiles":
Ra,1 Ra3
Ag,1 R13
A11 A A 1,3
R IF x, msx,
: 1
AND
x, IS msxjn.
THEN y IS msy,
resulted in the highest degree of
assuming that tile G, Ji)
.
training pattern.
membership for the corresponding
will in
rule weight to each rule: The degreeof membership
3. Assign a
rule as rule-weight B
addition be assigned to each
Determine an output based on an input-vector: Given an inputx
First
used to compute a crisp output ý.
the resulting rule-basecan be
the degree of fulfillmentfor
each rule is computed:
HO )x) =min 1
(x), mst
centroid defuzzification
then the output ý is combined through a
formula:
-1 P H
7) )
1m
=1 Pj,H ))
j=1,..mj,
2-28 J (CS-5/IT-6) Data Analysis
2 Fast processingtime
PART-4
Clustering High Dimensional Data, CLIQUE
and ProCLUS, Frequent Puttern Based Clustering Methods.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Approaches for high dimensional data clustering are :
1. Subspace clustering:
a. Subspace clustering subspace elustering algorithms localize the
search for relevant dimensions allowing them to find clusters that
exist in multiple, and possibly overlapping subspace
b. This technique is an extension of feature selection that attempts
to find ciusters in different subspaces of the same dataset.
Subspace clustering requires a search method and evaluation
criteria.
Projectedclustering:
a. In high-dimensional spaces, even though a good partition cann0t
be defined on all the dimensions because of the sparsity of the
data, some subset of the dimensions can always be obtained on
which some subsets of data form high quality and significant
clusters.
b. Projected clustering methods are aimed to find clusters specifie to
a particular group of dimensions. Each cluster may refer to
different subsets of dimensions.
C. The output of a typical projected clustering algorithm, searching
for k clusters in subspaces of dimension l1, is twofold:
A partition of data of k + 1 different clusters, where the first
k clusters are well shaped, while the (k + 1th cluster elements
are outliers, which by definition do not cluster wel.
i.
A possiblydifferentset ofI dimensions for each of the first k
clusters, such that the points in each of those clusters are
well clustered in the subspaces defined by these vectors.
3 Biclustering:
a. Bicustering (or two-way elustering) is a methodology allowing for
feature set and data points clustering simultaneously, i.e., to find
clusters of samples possessing similar characteristics together with
features creating these similarities.
b. The output of biclustering is not a partition or hierarchy of
partitions of either rows or columns, but a partition of the whole
matrix into sub-matrices or patches.
C. The goal of biclustering is to find as many patches as possible, and
to have them as large as possible, while
maintaining strong
homogeneity within patches.
Answer
1. CLIQUE is a subspace clustering method.
22 CLIQUE (CLustering In QUEst) is a simple grid-based method for
finding density based clusters in subspaces.
3. CLIQUE partitions each dimension into non-overlapping intervals,
thereby partitioning the entire embedding space of the data objects
into cells. Ituses density threshold to
a
identifydense cells and sparse
ones.
b. Second step:
i. In the second step, CLIQUE uses the dense cells in each subspace
to assemble clusters, which can be of arbitrary shape.
ii. The idea is to apply the Minimum Description Length (MDL)
principle to use the maximal regions to cover connected dense
cells, where a maximal region is a hyper rectangle where every
cell falling into this region is dense, and the region cannot be
extended further in any dimension in the subspace.
Answer
1 Projected clustering (PROCLUS) is a top-down subspace clustering
algorithm.
PROCLUS samples the data and then selects a set of k-medoids and
2.
iteratively improves the clustering.
PROCLUS is actually faster than CLIQUE due to the sampling of large
3.
data sets.
4-24 J (CS-5/1T-6) Frequent ltemsets and Clustering
Answer
Basic subspace clustering approaches are:
1 Grid-based subspace clustering:
a. In this approach, data space is divided into axis-parallel cells. Then
the cells containing objects above a predefined threshold value
given as a parameter are merged to form subspace clusters. Number
of intervals is another input parameter which defines range of
values in each grid.
b. Apriori property is used to prune non-promising cells and to
improve efficiency.
C. Ifa unit is found to be dense ink-1 dimension, then it is considered
for finding dense unit in k dimensions.
d. If grid boundaries are strictly followed to separate objects, accuracy
of clustering result is decreased as it may miss neighbouring
objects which get separated by string grid boundary. Clustering
quality is highly dependenton input parameters.
Data Analytics 1-25 J(CS-5/1T-6)
2. Window-based subspaceclustering
a. Window-based subspace clustering overcomes drawbacks of
cell-based subspace clustering that it may omit significant results.
b. Here a window slides across attribute values and obtains
overlapping intervals to be used to form subspace clusters.
C. The size of the sliding window is one of the parameters. These
algorithmsgenerate axis-parallelsubspaceclusters.
3. Density-based subspace clustering:
a. A density-based subspace clustering overcome drawbacks of grid-
based subspace clustering algorithms by not using grids.
b. is defined collection ofohjects forminga chain which
Acluster as a
fall within a given distance and exceed predefined threshold of
object count. Then adjacent dense regions are merged to form
bigger clusters.
C. As no grids are used, these algorithms can find arbitrarily shaped
subspace clusters.
d. Clusters are built by joining together the objects from adjacent
dense regions.
e. These approaches are prone to values of distance paranmeters
f. The effect curse of dimensionality is overcome in density-based
algorithms by utilizing a density measure which is adaptive to
subspace size.
Answer
The major tasks of clustering evaluation include the following
1. Assessing clustering tendency:
a.
In this task, for a given data set, we assess whether a non-random
structure exists in the data.
a.
A few algorithms, such as k-means, require the number of clusters
in a data set as the parameter.
Moreover, the number of clusters can be regarded as an interesting
b.
and important summary statistic of a data set.
Therefore, it is desirable to estimate this number even before a
C.
clustering algorithm is used to derive detailed clusters.
4-26 J (CS-5//T-6) Frequent Itemsets and Clustering
PART-5|
Clustering in Non-Euclidean Space, Clustering For
Streams and Parallelism.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answerr
1. The representation of a cluster in main memory consists of several
features
2. Before listing these features, if p is any point in a cluster, let
ROWSUM(p) be the sum of the squares of the distances from p
of the other points in the cluster.
3. The following features form the representation of a cluster.
a. N, the number of points in the cluster.
b. The clustroid of the cluster, which is defined specifically to be the
point in the cluster that minimizes the sum of the squares of the
distances to the other points; that is, the clustroid is the point in
the cluster with the smallest ROWSUM.
C. The rowsum of the clustroid of the cluster.
For some chosen constant k, the k points of the cluster that are
closest to the clustroid, and their rowsums. These points are part
of the representation in case the addition of points to the cluster
causes the clustroid to change. The assumption is made that the
new clustroid would be one of these k points near the old clustroid.
Data Analytics 4-27 J (CS-5/1T-6)
5. The k points of the cluster that are furthest from the clustroid and
their rowsums. These points are part of the representation so that
we can consider whether two clusters are close enough to merge.
The assumption is made that if two clusters are close, then a pair
of points distant from their respective clustroids would be close.
algorithm.
Answer
1. The clusters are organizedinto a tree, and the nodes of the tree may be
very large, perhaps disk blocks or pages, as in the case of a B-tree,
which the cluster-representing tree resembles.
Answerr
the stream are partitioned into, by,
1 In BMDO algorithm, the points of
two. Here, the size of a bucket is
buckets whose sizes are a power of
rather than the number of stream
the number of points it represents,
elements that are 1.
4-28 J (CS-5/IT-6) Frequent Itemsets and Clustering
one or two of
2. The sizes of buckets obey the restriction that there is
each size, up to some limit. They are required only to form a sequence
where each size is twice the previous size such as 3, 6, 12, 24,.
3. The contents of a bucket consist of:
a.
a. The size of the bucket.
b. The timestamp of the bucket, that is, the most recent point that
contributes to the bucket.
A collection of records that represent the clusters into which the
C.
points of that bucket have been partitioned. These records contain:
i. The number of points in the cluster.