0% found this document useful (0 votes)
6 views

Week 02.1 Chaptr002

Uploaded by

liyabi7540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 02.1 Chaptr002

Uploaded by

liyabi7540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Mining:

Concepts and Techniques

— Chapter 2 —

Arslan Anjum
[email protected]

1
Quartile Deviation
 A measure similar to the special range (Q) is the inter-
quartile range . It is the difference between the third quartile
(Q3) and the first quartile (Q1). Thus

Q1=(n+1)/4 Q2=2*[(n+1)/4] Q3=3[*(n+1)/4]

Q  Q3  Q1
 Where ‘n’ is the number of observations.

 The inter-quartile range is frequently reduced to the measure


of semi-interquartile range, known as the quartile deviation
(QD), by dividing it by 2. Thus
Quartile Deviation
 Example: The wheat production (in Kg) of 20 acres is given as:
1120, 1240, 1320, 1040, 1080, 1200, 1440, 1360, 1680,
1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470,
1750, and 1885.
 Find the quartile deviation.

 Solution:
 After arranging the observations in ascending order, we
get
1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440,
1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880,
1885, 1960.
Quartile Deviation
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

5
Boxplot in Matlab

 >> d = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70,
110];
 >> boxplot(d);

6
Visualization of Data Dispersion: 3-D Boxplots

May 2, 2024 Data Mining: Concepts and Techniques 7


Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
8
Histogram Analysis
 Histogram: Graph display of tabulated frequencies, shown as bars
 It shows what proportion of cases fall into each of several
categories
 The categories are usually specified as non-overlapping intervals of
some variable. The categories (bars) must be adjacent

9
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data are
below or equal to the value xi

Data Mining: Concepts and Techniques 10


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

11
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

12
Positively and Negatively Correlated Data

 The left half fragment is positively correlated


 The right half is negative correlated

13
Uncorrelated Data

14
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

15
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto graphical
primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships among
data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
16
Pixel-Oriented Visualization Techniques
 For a data set of m dimensions,
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
 The colors of the pixels reflect the corresponding values

(a) Income (b) Credit Limit (c) transaction volume (d) age
17
Geometric Projection Visualization Techniques

 Visualization of geometric transformations and projections


of the data
 Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes
 Projection pursuit technique: Help users find meaningful
projections of multidimensional data
 Prosection views
 Parallel coordinates

18
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.

news articles
visualized as
a landscape

 Visualization of the data as perspective landscape


 The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the data
19
Parallel Coordinates
 n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
 The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
 Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute

• • •

Attr. 1 Attr. 2 Attr. 3 Attr. k


20
http://support.sas.com/documentation/
Icon-Based Visualization Techniques

 Visualization of the data values as features of icons


 Typical visualization methods
 Chernoff Faces
 Stick Figures
 General techniques
 Shape coding: Use shape to represent certain
information encoding
 Color icons: Use color icons to encode more
information
 Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval
22
Chernoff Faces
 A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)
 REFERENCE: Gonick, L. and Smith, W.
The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
23
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell

gender,
education, etc.

A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)

two attributes mapped to axes, remaining attributes mapped to angle or length of limbs
24
Hierarchical Visualization Techniques

 Visualization of the data using a hierarchical


partitioning into subspaces
 Methods
 Worlds-within-Worlds
 Dimensional Stacking
 Tree-Map
 Cone Trees

25
Worlds-within-Worlds
 Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
 Software that uses this paradigm

 N–vision: Dynamic
interaction through data,
including rotation, scaling
(inner) and translation
(inner/outer)
 Auto Visual

26
Dimensional Stacking

attribute 4
attribute 2

attribute 3

attribute 1

 Partitioning of the n-dimensional attribute space in 2-D


subspaces, which are ‘stacked’ into each other
 Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
 But, difficult to display more than nine dimensions
 Important to map dimensions appropriately

27
Tree-Map
 Screen-filling method which uses a hierarchical partitioning of
the screen into regions depending on the attribute values
 The x- and y-dimension of the screen are partitioned alternately
according to the attribute values (classes)

https://support.office.com/
28
InfoCube
 A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
 The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on

29

You might also like