02Data (2)
02Data (2)
Techniques
— Chapter 2 —
Summary
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
wi
crosstabs
n
y
Document data: text documents:
term-frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction 2 Beer, Bread
sequences 3 Beer, Coke, Diaper, Milk
Genetic sequence data 4 Beer, Bread, Diaper, Milk
Spatial, image and multimedia: 5 Coke, Diaper, Milk
Spatial data: maps
Image data:
Video data:
Important Characteristics of
Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
Data Objects
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Attribute Types
Nominal:
Nominal means “relating to names.” The values of a
values
E.g., zip codes, profession, or the set of words
in a collection of documents
Sometimes, represented as integer variables
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
as floating-point variables
Chapter 2: Getting to Know Your
Data
Summary
Basic Statistical Descriptions of
Data
Motivation
To better understand the data: central
tendency, variation and spread
Central tendency: A measure of central tendency (also
referred to as measures of centre or central location) is a
summary measure that attempts to describe a whole set of
data with a single value that represents the middle or centre
of its distribution.
Variation of the Data: Measures of variation are statistics of
how far away the values in the observations (data points)
are from each other. There are different measures of
variation such as Range, Quartiles and Percentiles,
Interquartile Range, Standard Deviation.
Spread: Measures of spread describe how similar or varied
the set of observed values are for a particular variable (data
item). Measures of spread include the range, quartiles and
Basic Statistical Descriptions of
Data
Data dispersion characteristics: It should be based
on all the observations of the series. It should be
rigidly defined. It should not be affected by
extreme values.
median, max, min, quantiles, outliers, variance,
etc.
Numerical dimensions correspond to sorted
intervals
Data dispersion: analyzed with multiple
granularities of precision
Boxplot or quantile analysis on sorted intervals
Population: A population is the entire group that
you want to draw conclusions about.
Sample:A sample is defined as a smaller and more
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean
w x i i
trimmed mean: Extreme Values x i 1n
w
i 1
i
Median:
(52+56)/2=108/2=54
Mode: The two modes are
$52,000 and $70,000
Symmetric vs.
Skewed Data
Median, mean and mode of symmetric
N i 1 N
xi 2
i 1
2
width
The categories are usually
specified as non-overlapping
Histogram Analysis
The range of values is partitioned into disjoint consecutive
subranges. The subranges, referred to as buckets or bins, are
disjoint subsets of the data distribution.
The range of a bucket is known as the width.
Buckets (orbins) are defined by equal-width ranges
A histogram
Histograms Often Tell More than
Boxplots