0% found this document useful (0 votes)
15 views48 pages

Chapter Five

Chapter Five discusses descriptive statistics and exploratory data analysis (EDA), focusing on their definitions, types, and techniques. It covers key concepts such as central tendency, variability, and visualization methods including histograms, box plots, and scatter plots. The chapter emphasizes the importance of data representation for understanding data characteristics and relationships.

Uploaded by

samueldagnaw6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views48 pages

Chapter Five

Chapter Five discusses descriptive statistics and exploratory data analysis (EDA), focusing on their definitions, types, and techniques. It covers key concepts such as central tendency, variability, and visualization methods including histograms, box plots, and scatter plots. The chapter emphasizes the importance of data representation for understanding data characteristics and relationships.

Uploaded by

samueldagnaw6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

University of Gondar

College of Informatics
Department of Information Systems
Data and Business Analytics
InSy4145

Ibrahim Gashaw (PhD)


[email protected]

1
Chapter Five
Descriptive
Statistics and Exploratory
Data Analysis

2
Outline


What is descriptive statistics and Exploratory
Data Analysis (EDA)?

Basic numerical summaries of data

Basic graphical summaries of data
“Central Dogma” of Statistics

Probability
Population
Descriptive
Statistics

Sample

Inferential Statistics
Types of Data

Categorical Quantitative

binary nominal ordinal discrete continuous

2 categories
more categories
order matters
numerical
uninterrupted
What is descriptive statistics
Descriptive statistics helps to describe and
understand the features of a specific data set by giving
short summaries about the sample and measures of
the data.
In quantitative research, after collecting data, the first
step of statistical analysis is to describe characteristics
of the responses, such as the average of one variable
(e.g., age), or the relation between two variables (e.g.,
age and creativity).
The next step is inferential statistics, which help you
decide whether your data confirms or refutes your
hypothesis and whether it is generalizable to a larger
population
Types of descriptive statistics

There are 3 main types of descriptive


statistics:

The distribution concerns the frequency of
each value.

The central tendency concerns the averages
of the values.

The variability or dispersion concerns how
spread out the values are.
Types of descriptive statistics...
Numerical Summaries of Data
• Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.

• Variation or Variability measures. They


describe “data spread” or how far away the
measurements are from the center.

• Relative Standing measures. They describe


the relative position of specific measurements in the
data.
Mode
The mode is the simply the most popular or most frequent
response value. A data set can have no mode, one mode, or
more than one mode.

To find the mode, order your data set from lowest to highest
and find the response that occurs most frequently.

Eg, Mode number of library visits
Ordered data set

0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3

Why Squared Deviations?

• Adding deviations will yield a sum of ?


• Absolute values do not have nice mathematical
properties
• Squares eliminate the negatives
• Result:
– Increasing contribution to the variance as you go
farther from the mean.
Scale: Standard Deviation

Variance is somewhat arbitrary
• What does it mean to have a variance of 10.8? Or
2.2? Or 1459.092? Or 0.000001?
• Nothing. But if you could “standardize” that value,
you could talk about any variance (i.e. deviation) in
equivalent terms
• Standard deviations are simply the square root of
the variance
Scale: Standard Deviation...

7. Square root – now the value is in the units we started with!!!


Percentiles (Quantiles)
Exploratory Data Analysis

Get Data

Exploratory Data Analysis

Preprocessing

Predictive and Descriptive modeling


Techniques Used In Data Exploration

 In EDA
– The focus is on visualization
– Clustering and anomaly detection can be viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory

 In our discussion of data exploration, we focus


on
1. Summary statistics
2. Visualization
Iris Sample Data Set

 Many of the exploratory data techniques are illustrated


with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
 Setosa
 Virginica
 Versicolour
– Four (non-class) attributes
 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Visualization

Visualization is the conversion of data into a visual


or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.

 Visualization of data is one of the most powerful


and appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature

 The following shows the Sea Surface Temperature


(SST) for July 1982
– Tens of thousands of data points are summarized in a
single figure
Representation

 Is the mapping of information to a visual format


 Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
 Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
Example: Visualizing Universities
Visualization Techniques: Histograms

 Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
 Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
 Show the joint distribution of the values of two
attributes
 Example: petal width and petal length
– What does this tell us?
Visualization Techniques: Histograms

 Several variations of histograms exist: equi-bin(most


popular), other approaches use variable bin sizes…
 Choosing proper bin-sizes and bin-starting points is a non
trivial problem!!
Visualization Techniques: Box Plots

 Box Plots (we do not use the version depicted below!)


– Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot
outlier

90th percentile
Also see: https://en.wikipedia.org/wiki/Box_plot

75th percentile

50th percentile
25th percentile

10th percentile
Boxplots in R (we use those!!)

By default, boxplot() in R plots the maximum and the minimum


non-outlying values instead of the 10th and 90th percentiles as the
book describes. Outliers in BPs are values that are 1.5*IQR or
more away from the box, where IQR is the height of the box!
See:
> a<-c(11,12, 22, 33, 34, 100)
> boxplot(a)
> b<-c(11,12, 22, 33, 34, 65) outlier
> boxplot(b)
 a<-c(11,12,22, 33, 34, 50, 100)
90th percentile Maximum non-outlier
 boxplot(a)

75th percentile

50th percentile IQR (a proxy for standard


deviation)
25th percentile

10th percentile Minimum non-outlier


Example of R Box Plots (Mid1 Question)
b) The following boxplot has been created using the following R-code for an attribute x:
> x<-c(1,2,2,2,4,4,8,9,9,10,18,22)
> boxplot(x)

R version 3.4.3: Kite-Eating Tree

What is the median for the attribute x? What is the IQR for the attribute x? The lower whisker of the boxplot as at
1; what does this tell you? According the boxplot 18 is not an outlier and 22 as an outlier; why do you believe this
is the case?
Median is 6=(4+8)/2
IQR=9.5-2=7.5
1 is the lowest value in the dataset that is not an outlier Every value that is 1.5*IQR above the 75th percentile is
an outlier; that is, for the particular boxplots values above 9.5+1.5*7.5=20.75 and below the 25 th percentile -9.25
are outliers; consequently, 22 is an outlier and 1 and 18 is not, and the whiskers are therefore at 1 and 18!
Attribute Standardization: Z-scores

 Attribute Standardization/Normalizationmakes attributes


equally import, alleviates impact of attribute scale
 Z-score standardization:
– Calculate the mean mf, the standard deviation sf:
– Calculate the standardized measurement (z-score)
xif  m f
zif  s
f

 Result of Z-score standardization is a dataset in which each


attribute has a mean of 0 and a standard deviation of 1.
 The obtained attribute values allow for statistical
interpretation: e.g. if a person’s z-scored age is -1 her age is
on standard deviation below the average age…
 Z-scores can be interpreted based on the 68-95-99.7 Rule!
35
[0,1] Attribute Standardization

Approach: Normalize interval-scaled variables using

where min(x) denotes the minimum value and max(x) denotes the
maximum value of the xth attribute in the data set; that is, all values
of the normalized dataset are numbers in [0,1].

Question: if the normalized value of an attribute is 0; what does this


mean?

36
Visualization Techniques: Scatter Plots

 Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
 For prediction scatter plots see:
http://en.wikipedia.org/wiki/Scatter_plot
http://en.wikipedia.org/wiki/Correlation (Correlation)
 See example for classification, also called supervised scatter
plots, on the next slide
Scatter Plot Array of Iris Attributes
Visualization Techniques: Contour Plots

 Contour plots
– Useful when a continuous attribute is measured on a
spatial grid
– They partition the plane into regions of similar values
– The contour lines that form the boundaries of these
regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air pressure,
etc.
Contour Plot Example: SST Dec, 1998

Celsius
Density Plots

A density plot is a representation of the distribution


of a numeric variable. It uses a kernel density
estimate to show the probability density function of
the variable.
It is a smoothed version of the histogram and is
used in the same concept.
 https://en.wikipedia.org/wiki/Probability_density_fu
nction
 http://ggplot2.tidyverse.org/reference/geom_densit
y_2d.html
 https://python-graph-gallery.com/2d-density-plot/
Density Plots
Visualization Techniques: Parallel Coordinates

 Parallel Coordinates
– Used to plot the attribute values of high-dimensional
data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the
points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of objects
group together, at least for some attributes
– Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
Other Visualization Techniques

 Star Coordinate Plots


– Similar approach to parallel coordinates, but axes radiate from a
central point
– The line connecting the values of an object is a polygon
 Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a characteristic of a
face
– The values of each attribute determine the appearance of the
corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
– http://people.cs.uchicago.edu/~wiseman/chernoff/
– http://kspark.kaist.ac.kr/Human%20Engineering.files/Chernoff/Ch
ernoff%20Faces.htm#
Star Plots for Iris Data

Setosa

Versicolour
Pedal length Sepal Width

Virginica
Sepal length
Pedal width
Chernoff Faces for Iris Data
Translation: sepal lengthsize of face; sepal width forhead/jaw relative to arc-length;
Pedal lengthshape of forhead; pedal width shape of jaw; width of mouth…; width between eyes…

Setosa

Versicolour

Virginica
Thank you!!!

You might also like