Data Distribution
Data Distribution
Modeling
Btech-Tech VIth Semester
Contents: Unit 2 Lecture 1
Data Objects and Attributes Types
Basic Statistical Description of Data
Data Visualization
Measuring of Data Similarity and Dissimilarity
Summary
timeout
season
coach
game
score
pla y
team
wi n
ball
lost
Record m-
Relational records
Data matrix, e.g., numerical matrix, crosstabs
Document 1 3 0 5 0 2 6 0 2 0 2
Document data: text documents: term-frequency vector
Transaction data Document 2 0 7 0 2 1 0 0 3 0 0
Graph and network
Document 3 0 1 0 0 1 2 2 0 3 0
World Wide Web
Social or information networks
Molecular Structures
Ordered
Video data: sequence of images
Temporal data: time-series TID Items
Sequential Data: transaction sequences
1 Bread, Coke, Milk
Data Object
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
Prof. Moumita Pal 6
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
Prof. Moumita Pal 7
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection
of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Mode:
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
Prof. Moumita Pal 11
Symmetric vs Skewed Data
Median, mean and mode of symmetric, positively and negatively skewed
data
1
0
5
Prof. Moumita Pal 18
0
1000 3000 5000 7000 9000
Histograms Often Tell More than
Boxplots
The two histograms
shown in the left may
have the same boxplot
representation
The same values for:
min, Q1, median, Q3,
max
But they have rather
different data
distributions
Prof. Moumita Pal 19
Quantile Plot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f indicates that
i i
approximately 100 fi% of the data are below or equal to the
value xi