0% found this document useful (0 votes)
17 views

Data Analytics Full PDF

Uploaded by

kapoormonika513
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Analytics Full PDF

Uploaded by

kapoormonika513
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Data Analytics 2-23 J (CS-5/IT-6)

H(x)
Core

X
0

Support

Boundary
Boundary
Fig. 2.24.1. Core, support,and boundariesof a fuzzy set.
3. Boundaries:
a. The boundaries of a membership function for some fuzzy set A
are defined as that region of the universe containing elements that
have a non-zero membership but not complete membership.
b. The boundaries comprise those elements x of the universe such
that 0< H (x) < 1.

Que 2.25. | Explain the inference in fuzzy logic.

Answer
FuzzyInference:
1. Inferences is a technique where facts, premises F, P ,and a
goal Gis to be derived from a given set.
2. Fuzzy inference is the process of formulating the mapping fromagiven
input to an output using fuzzy logic.
3. The mapping then provides a basis from which decisions can be made.

4. Fuzzy inference (approximate reasoning) refers to computational


procedures used for evaluating inguistic (IF-THEN) descriptions.
5. The two important inferring procedures are:

1. GeneralizedModus Ponens (GMP):


1. GMPis formally stated as
Ifx is A THEN y is B

is A
y is B

Here, A, B, A' and B are fuzzy terms.


2. Every fuzzy linguisticstatement above the line is analyticallyknown
and what is below is analytically unknown.
2-24 J (CS-5/IT-6) Data Analysis

Here oX (x,y)
=

where'o' denotes max-min composition (IF-THEN relation)


3 The membership function is
P)= max (min (u, a), Fglx,y)

where ,(y) is membership function of B, u, (r) is membership

function of A' and H^(z,y) is the membership function of


implication relation.
ii. Generalized Modus Tollens (GMT)
1. GMTis defined as
Ifx is A. Then y is B

y is B

x is A'
2. The membership of A' is computed as

A =® oR x, y)
3. In terms of membership function
u(x) = max (min ( (y), u,(x,y))

Que 2.26. ExplainFuzzy DecisionTree (FDT).

Answe
1 Decision trees are one of the most popular methods for learning and
reasoning from instances.
2. Given a set of n input-outputtraining patterns D {(X, y') i =
1,.., n]
=

where each training pattern X has been described by a set ofp conditional
(or input) attributes (x,.)and one corresponding discrete class label
where y' e |1,..,g) and q is the number
of classes.
3. The decision attribute y' represents a posterior knowledge
the class of each pattern.
regarding
4. An arbitrary class has been indexed by l (1</< q) and each class l has
been modeled as a crisp set.
5. The membership degree of the ih value of the decision attribute y'
concerning the ith class is defined as follows

Hy)= Yy belongto'th class:


0, otherwise.
Data Analytics 2-25 J (CS-5/1T-6)

6. The architecture of induetion of FDT is given in Fig. 2.26.1

Fuzzy partitioning|
Training
data

Fuzzy ID3

Fuzzy classification
rules
Test
Data Product-product-
sum reasoning

Actual class lavel Estimated class lable

Classification
accuracy

Architecture of Fuzzy decision tree induction.


Fig. 2.26.1.
The generation of FDT for pattern classification consists of three major
7.
steps namely fuzzy partitioning (clustering), induction of FDT and fuzzy
rule inference for classification.
The first crucial step in the induction process of FDT is the fuzzy
.
partitioningof input space using any fuzzy clusteringtechniques.
FDTs are constructed using any standard algorithm like Fuzzy ID3
9.
where we follow a top-down, recursive divide and conquer approach,
which makes locally optimal decisions at each node.
into
10. As the tree is being built, the training set is recursivelypartitioned
smaller subsets and the generated fuzzy rules are used to predict the
class of an unseen pattern by applying suitable fuzzy inference/reasoning
mechanism on the FDT.
11. The general procedure for generating fuzzy decision trees using Fuzzy
ID3 is as follows:
Prerequisites: A Fuzzy partitionspace, leaf selection
threshold B, and
the best node selection criterion

Procedure:
While there exist candidate nodes
DO Select one of them using a search strategy.
Generate its child-nodes according to an expanded attribute obtained by
the given heuristic.
2-26 J (CS-5/T-6) Data Analysis

Check child nodes for the leaf selection threshold.


Child-nodes meeting the leaf threshold have to be terminated as leaf
nodes.
The remaining child-nodes are regarded as new candidate node.
end

Que 2.27. Write shortnotes on extractinggrid-basedfuzzy models


from data.

Answer
1. Grid-based rule sets model each input variable through a usually small
set of linguistic values.
2. The resulting rule base uses all or a subset of all possible combinations
of these linguistic values for each variable, resulting in a global
ulation of the feature space into "tiles":

RIFx ISA,, AND... AND x, ISA,, THEN ..

R IF x, ISA,, AND ... AND x, IS A,. THEN..


* *In,n

Ra,1 Ra3

Ag,1 R13

A11 A A 1,3

Fig. 2.27.1. A globalgranulation of the input space using


threemembershipfunetions for x and two for x
where, (1< i<n) indicates the numbers of linguistic values for
variable i in the n-dimensional feature space. Fig. 2.27.1 illustrates this
approach in two dimensions with , = 3 and 1, = 2.
3. Extractinggrid-basedfuzzy models from data is straightforwardwhen
the input granulation is fixed, that is, the antecedents of all rules are
2-27 J (CS-6/1T-6)
Data Analytics
matching consequent for each rule needs to be
predefined. Then only a

found variables and also the


After predefinition of the granulation ofall input
4. determines the
sweep through the entire dataset
output variable, one
center of each rule, assigning
the
closest example to the geometrical
closest output fuzzy value to the
corresponding rule:

1. Granulate the input and output space:


membershipP
a. Divide each variable X, into l, equidistant triangular
functions.
for the
membership functions
b. Similarly the granulation into , overlapping
resulting in the typical
output variable y is determined, functions.
distribution of triangular membership
with respect.
approach
illustrates this in two dimensions
C. Fig. 2.27.1
to membership functions,
resulting in six tiles.
rules from given data:
2. Generate fuzzy determine
this means that we have to
a. For the examplein Fig. 2.27.1,
each rule.
the best consequence for
to each
the degree of membership
b. For each example pattern (x, y)
determined:
ofthe possible tiles is
minm,, ma n
Then ms, , indicates the membership
with 1 j, sl, and <j, l,.
1 <
value of input variable i and similar for
function of the j-th linguistic
tile resulting in the maximum
outputvariable y. Next the
msy for the one rule
degree of membership is used to generate

R IF x, msx,
: 1
AND
x, IS msxjn.
THEN y IS msy,
resulted in the highest degree of
assuming that tile G, Ji)
.

training pattern.
membership for the corresponding
will in
rule weight to each rule: The degreeof membership
3. Assign a
rule as rule-weight B
addition be assigned to each
Determine an output based on an input-vector: Given an inputx
First
used to compute a crisp output ý.
the resulting rule-basecan be
the degree of fulfillmentfor
each rule is computed:

HO )x) =min 1
(x), mst
centroid defuzzification
then the output ý is combined through a

formula:

-1 P H
7) )
1m
=1 Pj,H ))
j=1,..mj,
2-28 J (CS-5/IT-6) Data Analysis

where ui denotes the center of the output region of the


correspondingrule with index(,...J,)
Data Analytics 4-21 J (CS-5/IT-6)

Characteristics of hierarchical methods:


1. Clustering is a hierarchical decomposition ti.e., multiple levels)
2 Cannot correct erroneous merges or splits

3. May incorporate other techniques like miero clustering or consider


object"linkages
Characteristies of density-based methods:
1. Can find arbitrarily shaped clusters
2 Clusters are dense regions of objects in space that are separated by
low-densityregions
3. May filter out outliers
Characteristics of grid-based methods
1. Use a multi resolution grid data structure

2 Fast processingtime

PART-4
Clustering High Dimensional Data, CLIQUE
and ProCLUS, Frequent Puttern Based Clustering Methods.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

What are the approaches for high dimensional data


Que 4.23.
clustering ?

Answer
Approaches for high dimensional data clustering are :

1. Subspace clustering:
a. Subspace clustering subspace elustering algorithms localize the
search for relevant dimensions allowing them to find clusters that
exist in multiple, and possibly overlapping subspace
b. This technique is an extension of feature selection that attempts
to find ciusters in different subspaces of the same dataset.
Subspace clustering requires a search method and evaluation
criteria.

d It limits the scope of the evaluation eriteria so as to consider


different subspaces for each different cluster.
4-22 J (CS-5/1T-6) Frequent Itemsets and Clustering

Projectedclustering:
a. In high-dimensional spaces, even though a good partition cann0t
be defined on all the dimensions because of the sparsity of the
data, some subset of the dimensions can always be obtained on
which some subsets of data form high quality and significant
clusters.
b. Projected clustering methods are aimed to find clusters specifie to
a particular group of dimensions. Each cluster may refer to
different subsets of dimensions.
C. The output of a typical projected clustering algorithm, searching
for k clusters in subspaces of dimension l1, is twofold:
A partition of data of k + 1 different clusters, where the first
k clusters are well shaped, while the (k + 1th cluster elements
are outliers, which by definition do not cluster wel.
i.
A possiblydifferentset ofI dimensions for each of the first k
clusters, such that the points in each of those clusters are
well clustered in the subspaces defined by these vectors.
3 Biclustering:
a. Bicustering (or two-way elustering) is a methodology allowing for
feature set and data points clustering simultaneously, i.e., to find
clusters of samples possessing similar characteristics together with
features creating these similarities.
b. The output of biclustering is not a partition or hierarchy of
partitions of either rows or columns, but a partition of the whole
matrix into sub-matrices or patches.
C. The goal of biclustering is to find as many patches as possible, and
to have them as large as possible, while
maintaining strong
homogeneity within patches.

Que 4.24.| Write short note on CLIQUE.

Answer
1. CLIQUE is a subspace clustering method.
22 CLIQUE (CLustering In QUEst) is a simple grid-based method for
finding density based clusters in subspaces.
3. CLIQUE partitions each dimension into non-overlapping intervals,
thereby partitioning the entire embedding space of the data objects
into cells. Ituses density threshold to
a
identifydense cells and sparse
ones.

4. A cell is dense ifthe number of objects


mapped to it exceeds the density
threshold.
5. The main strategy behind CLIQUE for identifyinga candidate search
Data Analytics 4-23 J (CS-5/1T-6)

space uses the monotonicity of dense cells with respect to dimensionality.


This is based on the Apriori property used in frequent pattern and
association rule mining.
6.
6. In the context of clusters in subspaces, the monotonicity says the
following,Ak-dimensional cell e th> 1) can have at least 1points only
ifevery (k- 1)-dimensionalprojectionof c, which is a cell in a(k-1
dimensional subspace, has at least l points.
7. CLIQUE performs clustering in following two steps
a. First step:
1. In the first step, CLIQUE partitions the d-dimensional data space
into non-overlapping rectangular units, identifying the dense units
among these.
i. CLIQUE finds dense cells in all of the subspaces.
ii. To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing at least l points, where l is the
density threshold.
iv. CLIQUE then iteratively joins two k-dimensional dense cells,c
and cg, in subspaces(D ,.Di ) and (D ,..Di,), respectively,if
D, =D,. ..,Dip_1 =D_1, and c, and c, sharethe same intervals
in those dimensions. The join operation generates a new (k +1)-
dimensional candidate cell e in space (D,Di-1, Di ,Di
V.
CLIQUE checks whether the number of points in c passes the
density threshold. The iteration terminates when no candidates
can be generated or no candidate cells are dense.

b. Second step:
i. In the second step, CLIQUE uses the dense cells in each subspace
to assemble clusters, which can be of arbitrary shape.
ii. The idea is to apply the Minimum Description Length (MDL)
principle to use the maximal regions to cover connected dense
cells, where a maximal region is a hyper rectangle where every
cell falling into this region is dense, and the region cannot be
extended further in any dimension in the subspace.

short notes PROCLUS.


Que 4.25.| Write on

Answer
1 Projected clustering (PROCLUS) is a top-down subspace clustering
algorithm.
PROCLUS samples the data and then selects a set of k-medoids and
2.
iteratively improves the clustering.
PROCLUS is actually faster than CLIQUE due to the sampling of large
3.
data sets.
4-24 J (CS-5/1T-6) Frequent ltemsets and Clustering

4 The three phases of PROCLUSare as follows:


Initialization phase : Select a set of potential medoids
that are
a.
far apart using a greedy algorithm.
b. Iterationphase:
i. Select a random set of k-medoidsfrom this reduced data set to
determine if clustering quality improves by replacing current
medoids with randomly chosen new medoids.
i. Cluster quality is based on the average distance between
instances and the nearest medoid.
ii. For each medoid, a set of dimensions is chosen whose average
distances are small compared to statistical expectation.
iv. Once the subspaces have been selected for each medoid, average
Manhattan segmental distance is used to assign points to
medoids, forming dusters.
C Refinementphase:
i. Compute a new list of relevant dimensions for each medoid
based on the clusters formed and reassign points to medoids,
removing outliers.
i. The distance-based approach of PROCLUS is biased toward
clusters that are hype-sphericalin shape
Que 4.26. Discuss the basic subspace clustering approaches.

Answer
Basic subspace clustering approaches are:
1 Grid-based subspace clustering:
a. In this approach, data space is divided into axis-parallel cells. Then
the cells containing objects above a predefined threshold value
given as a parameter are merged to form subspace clusters. Number
of intervals is another input parameter which defines range of
values in each grid.
b. Apriori property is used to prune non-promising cells and to
improve efficiency.
C. Ifa unit is found to be dense ink-1 dimension, then it is considered
for finding dense unit in k dimensions.
d. If grid boundaries are strictly followed to separate objects, accuracy
of clustering result is decreased as it may miss neighbouring
objects which get separated by string grid boundary. Clustering
quality is highly dependenton input parameters.
Data Analytics 1-25 J(CS-5/1T-6)
2. Window-based subspaceclustering
a. Window-based subspace clustering overcomes drawbacks of
cell-based subspace clustering that it may omit significant results.
b. Here a window slides across attribute values and obtains
overlapping intervals to be used to form subspace clusters.
C. The size of the sliding window is one of the parameters. These
algorithmsgenerate axis-parallelsubspaceclusters.
3. Density-based subspace clustering:
a. A density-based subspace clustering overcome drawbacks of grid-
based subspace clustering algorithms by not using grids.
b. is defined collection ofohjects forminga chain which
Acluster as a
fall within a given distance and exceed predefined threshold of
object count. Then adjacent dense regions are merged to form
bigger clusters.
C. As no grids are used, these algorithms can find arbitrarily shaped
subspace clusters.
d. Clusters are built by joining together the objects from adjacent
dense regions.
e. These approaches are prone to values of distance paranmeters
f. The effect curse of dimensionality is overcome in density-based
algorithms by utilizing a density measure which is adaptive to
subspace size.

Que 4.27. What are the major tasks of clustering evaluation ?

Answer
The major tasks of clustering evaluation include the following
1. Assessing clustering tendency:
a.
In this task, for a given data set, we assess whether a non-random
structure exists in the data.

b. Blindly applying a clustering method on a data set will return


clusters; however, the clusters mined may be misleading.

C. Clustering analysis on data set is meaningfulonly when there is


a

a nonrandom structure in the data.

2. Determining the number of clusters in a data set:

a.
A few algorithms, such as k-means, require the number of clusters
in a data set as the parameter.
Moreover, the number of clusters can be regarded as an interesting
b.
and important summary statistic of a data set.
Therefore, it is desirable to estimate this number even before a
C.
clustering algorithm is used to derive detailed clusters.
4-26 J (CS-5//T-6) Frequent Itemsets and Clustering

3. Measuring clustering quality:


a. After applying a clustering method on a data set, we want to
assess how good the resulting clusters are.
b. A number of measures can be used.
C. Some methods measure how well the clusters fit the data set,
while others measure how well the clusters match the ground
truth, if such truth is available.
d. There are also measures that score elustering and thus can
compare two sets of clustering results on the same data set.

PART-5|
Clustering in Non-Euclidean Space, Clustering For
Streams and Parallelism.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 4.28. Explain representation of clusters in GRGPF algorithm.

Answerr
1. The representation of a cluster in main memory consists of several
features
2. Before listing these features, if p is any point in a cluster, let
ROWSUM(p) be the sum of the squares of the distances from p
of the other points in the cluster.
3. The following features form the representation of a cluster.
a. N, the number of points in the cluster.
b. The clustroid of the cluster, which is defined specifically to be the
point in the cluster that minimizes the sum of the squares of the
distances to the other points; that is, the clustroid is the point in
the cluster with the smallest ROWSUM.
C. The rowsum of the clustroid of the cluster.
For some chosen constant k, the k points of the cluster that are
closest to the clustroid, and their rowsums. These points are part
of the representation in case the addition of points to the cluster
causes the clustroid to change. The assumption is made that the
new clustroid would be one of these k points near the old clustroid.
Data Analytics 4-27 J (CS-5/1T-6)

5. The k points of the cluster that are furthest from the clustroid and
their rowsums. These points are part of the representation so that
we can consider whether two clusters are close enough to merge.
The assumption is made that if two clusters are close, then a pair
of points distant from their respective clustroids would be close.

Que 4.29.| Explain initialization of cluster tree in GRGPF

algorithm.
Answer
1. The clusters are organizedinto a tree, and the nodes of the tree may be
very large, perhaps disk blocks or pages, as in the case of a B-tree,
which the cluster-representing tree resembles.

representations as can fit.


2 Each leaf of the tree holds as many cluster
size that does not depend on the number
3. A cluster representation has a

of points in the cluster.


clustroids of
4. An interior node ofthe cluster tree holds a sample of the
to
the clusters represented by each of its subtrees, along with pointers
the roots of those subtrees.
that an interior
5. The samples are of fixed size, so the number of children
node may have is independent of its level.
cluster's clustroid is
6. As we go up the tree, the probability that agiven
part of the sample diminishes.
of the
7. We initialize the cluster tree by taking a main-memory sample
dataset and clustering it hierarchically
but T is not exactly the tree
8. The result of this clustering is a tree 1,
select from T certain of its
used by the GRGPF Algorithm. Rather, we
some desired size n.
nodes that represent clusters of approximately
GRGPF Algorithm, and we place
9. These are the initial clusters for the
tree. We
their representations at the leaf of the cluster-representing
ancestor in Tinto interior nodes of
then group clusters with a common
of the cluster-
the cluster-representing tree. In some cases, rebalancing
representing tree will be necessary.

Que 4.30. Write short note on BMDO stream clusteringalgorithm.

Answerr
the stream are partitioned into, by,
1 In BMDO algorithm, the points of
two. Here, the size of a bucket is
buckets whose sizes are a power of
rather than the number of stream
the number of points it represents,
elements that are 1.
4-28 J (CS-5/IT-6) Frequent Itemsets and Clustering

one or two of
2. The sizes of buckets obey the restriction that there is
each size, up to some limit. They are required only to form a sequence
where each size is twice the previous size such as 3, 6, 12, 24,.
3. The contents of a bucket consist of:
a.
a. The size of the bucket.
b. The timestamp of the bucket, that is, the most recent point that
contributes to the bucket.
A collection of records that represent the clusters into which the
C.
points of that bucket have been partitioned. These records contain:
i. The number of points in the cluster.

i. The centroid or clustroid of the cluster.


ii. Any other parameters necessary to enable us to merge clusters
and maintain approximations to the full set of parameters for
the merged cluster.

You might also like