University of Gondar
College of Informatics
Department of Information Systems
Data and Business Analytics
InSy4145
Ibrahim Gashaw (PhD)
[email protected] 1
Chapter Five
Descriptive
Statistics and Exploratory
Data Analysis
2
Outline
What is descriptive statistics and Exploratory
Data Analysis (EDA)?
Basic numerical summaries of data
Basic graphical summaries of data
“Central Dogma” of Statistics
Probability
Population
Descriptive
Statistics
Sample
Inferential Statistics
Types of Data
Categorical Quantitative
binary nominal ordinal discrete continuous
2 categories
more categories
order matters
numerical
uninterrupted
What is descriptive statistics
Descriptive statistics helps to describe and
understand the features of a specific data set by giving
short summaries about the sample and measures of
the data.
In quantitative research, after collecting data, the first
step of statistical analysis is to describe characteristics
of the responses, such as the average of one variable
(e.g., age), or the relation between two variables (e.g.,
age and creativity).
The next step is inferential statistics, which help you
decide whether your data confirms or refutes your
hypothesis and whether it is generalizable to a larger
population
Types of descriptive statistics
There are 3 main types of descriptive
statistics:
The distribution concerns the frequency of
each value.
The central tendency concerns the averages
of the values.
The variability or dispersion concerns how
spread out the values are.
Types of descriptive statistics...
Numerical Summaries of Data
• Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
• Variation or Variability measures. They
describe “data spread” or how far away the
measurements are from the center.
• Relative Standing measures. They describe
the relative position of specific measurements in the
data.
Mode
The mode is the simply the most popular or most frequent
response value. A data set can have no mode, one mode, or
more than one mode.
To find the mode, order your data set from lowest to highest
and find the response that occurs most frequently.
Eg, Mode number of library visits
Ordered data set
0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3
Why Squared Deviations?
• Adding deviations will yield a sum of ?
• Absolute values do not have nice mathematical
properties
• Squares eliminate the negatives
• Result:
– Increasing contribution to the variance as you go
farther from the mean.
Scale: Standard Deviation
Variance is somewhat arbitrary
• What does it mean to have a variance of 10.8? Or
2.2? Or 1459.092? Or 0.000001?
• Nothing. But if you could “standardize” that value,
you could talk about any variance (i.e. deviation) in
equivalent terms
• Standard deviations are simply the square root of
the variance
Scale: Standard Deviation...
7. Square root – now the value is in the units we started with!!!
Percentiles (Quantiles)
Exploratory Data Analysis
Get Data
Exploratory Data Analysis
Preprocessing
Predictive and Descriptive modeling
Techniques Used In Data Exploration
In EDA
– The focus is on visualization
– Clustering and anomaly detection can be viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
In our discussion of data exploration, we focus
on
1. Summary statistics
2. Visualization
Iris Sample Data Set
Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
Setosa
Virginica
Versicolour
– Four (non-class) attributes
Sepal width and length
Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Visualization
Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.
Visualization of data is one of the most powerful
and appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature
The following shows the Sea Surface Temperature
(SST) for July 1982
– Tens of thousands of data points are summarized in a
single figure
Representation
Is the mapping of information to a visual format
Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
Example: Visualizing Universities
Visualization Techniques: Histograms
Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
Show the joint distribution of the values of two
attributes
Example: petal width and petal length
– What does this tell us?
Visualization Techniques: Histograms
Several variations of histograms exist: equi-bin(most
popular), other approaches use variable bin sizes…
Choosing proper bin-sizes and bin-starting points is a non
trivial problem!!
Visualization Techniques: Box Plots
Box Plots (we do not use the version depicted below!)
– Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot
outlier
90th percentile
Also see: https://en.wikipedia.org/wiki/Box_plot
75th percentile
50th percentile
25th percentile
10th percentile
Boxplots in R (we use those!!)
By default, boxplot() in R plots the maximum and the minimum
non-outlying values instead of the 10th and 90th percentiles as the
book describes. Outliers in BPs are values that are 1.5*IQR or
more away from the box, where IQR is the height of the box!
See:
> a<-c(11,12, 22, 33, 34, 100)
> boxplot(a)
> b<-c(11,12, 22, 33, 34, 65) outlier
> boxplot(b)
a<-c(11,12,22, 33, 34, 50, 100)
90th percentile Maximum non-outlier
boxplot(a)
75th percentile
50th percentile IQR (a proxy for standard
deviation)
25th percentile
10th percentile Minimum non-outlier
Example of R Box Plots (Mid1 Question)
b) The following boxplot has been created using the following R-code for an attribute x:
> x<-c(1,2,2,2,4,4,8,9,9,10,18,22)
> boxplot(x)
R version 3.4.3: Kite-Eating Tree
What is the median for the attribute x? What is the IQR for the attribute x? The lower whisker of the boxplot as at
1; what does this tell you? According the boxplot 18 is not an outlier and 22 as an outlier; why do you believe this
is the case?
Median is 6=(4+8)/2
IQR=9.5-2=7.5
1 is the lowest value in the dataset that is not an outlier Every value that is 1.5*IQR above the 75th percentile is
an outlier; that is, for the particular boxplots values above 9.5+1.5*7.5=20.75 and below the 25 th percentile -9.25
are outliers; consequently, 22 is an outlier and 1 and 18 is not, and the whiskers are therefore at 1 and 18!
Attribute Standardization: Z-scores
Attribute Standardization/Normalizationmakes attributes
equally import, alleviates impact of attribute scale
Z-score standardization:
– Calculate the mean mf, the standard deviation sf:
– Calculate the standardized measurement (z-score)
xif m f
zif s
f
Result of Z-score standardization is a dataset in which each
attribute has a mean of 0 and a standard deviation of 1.
The obtained attribute values allow for statistical
interpretation: e.g. if a person’s z-scored age is -1 her age is
on standard deviation below the average age…
Z-scores can be interpreted based on the 68-95-99.7 Rule!
35
[0,1] Attribute Standardization
Approach: Normalize interval-scaled variables using
where min(x) denotes the minimum value and max(x) denotes the
maximum value of the xth attribute in the data set; that is, all values
of the normalized dataset are numbers in [0,1].
Question: if the normalized value of an attribute is 0; what does this
mean?
36
Visualization Techniques: Scatter Plots
Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
For prediction scatter plots see:
http://en.wikipedia.org/wiki/Scatter_plot
http://en.wikipedia.org/wiki/Correlation (Correlation)
See example for classification, also called supervised scatter
plots, on the next slide
Scatter Plot Array of Iris Attributes
Visualization Techniques: Contour Plots
Contour plots
– Useful when a continuous attribute is measured on a
spatial grid
– They partition the plane into regions of similar values
– The contour lines that form the boundaries of these
regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air pressure,
etc.
Contour Plot Example: SST Dec, 1998
Celsius
Density Plots
A density plot is a representation of the distribution
of a numeric variable. It uses a kernel density
estimate to show the probability density function of
the variable.
It is a smoothed version of the histogram and is
used in the same concept.
https://en.wikipedia.org/wiki/Probability_density_fu
nction
http://ggplot2.tidyverse.org/reference/geom_densit
y_2d.html
https://python-graph-gallery.com/2d-density-plot/
Density Plots
Visualization Techniques: Parallel Coordinates
Parallel Coordinates
– Used to plot the attribute values of high-dimensional
data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the
points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of objects
group together, at least for some attributes
– Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
Other Visualization Techniques
Star Coordinate Plots
– Similar approach to parallel coordinates, but axes radiate from a
central point
– The line connecting the values of an object is a polygon
Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a characteristic of a
face
– The values of each attribute determine the appearance of the
corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
– http://people.cs.uchicago.edu/~wiseman/chernoff/
– http://kspark.kaist.ac.kr/Human%20Engineering.files/Chernoff/Ch
ernoff%20Faces.htm#
Star Plots for Iris Data
Setosa
Versicolour
Pedal length Sepal Width
Virginica
Sepal length
Pedal width
Chernoff Faces for Iris Data
Translation: sepal lengthsize of face; sepal width forhead/jaw relative to arc-length;
Pedal lengthshape of forhead; pedal width shape of jaw; width of mouth…; width between eyes…
Setosa
Versicolour
Virginica
Thank you!!!