0% found this document useful (0 votes)

15 views48 pages

Chapter Five

Chapter Five discusses descriptive statistics and exploratory data analysis (EDA), focusing on their definitions, types, and techniques. It covers key concepts such as central tendency, variability, and visualization methods including histograms, box plots, and scatter plots. The chapter emphasizes the importance of data representation for understanding data characteristics and relationships.

Uploaded by

samueldagnaw6969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views48 pages

Chapter Five

Uploaded by

samueldagnaw6969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

University of Gondar

College of Informatics
Department of Information Systems
Data and Business Analytics
InSy4145

Ibrahim Gashaw (PhD)

[email protected]

1
Chapter Five
Descriptive
Statistics and Exploratory
Data Analysis

2
Outline


What is descriptive statistics and Exploratory
Data Analysis (EDA)?

Basic numerical summaries of data

Basic graphical summaries of data
“Central Dogma” of Statistics

Probability
Population
Descriptive
Statistics

Sample

Inferential Statistics
Types of Data

Categorical Quantitative

binary nominal ordinal discrete continuous

2 categories
more categories
order matters
numerical
uninterrupted
What is descriptive statistics
Descriptive statistics helps to describe and
understand the features of a specific data set by giving
short summaries about the sample and measures of
the data.
In quantitative research, after collecting data, the first
step of statistical analysis is to describe characteristics
of the responses, such as the average of one variable
(e.g., age), or the relation between two variables (e.g.,
age and creativity).
The next step is inferential statistics, which help you
decide whether your data confirms or refutes your
hypothesis and whether it is generalizable to a larger
population
Types of descriptive statistics

There are 3 main types of descriptive

statistics:

The distribution concerns the frequency of
each value.

The central tendency concerns the averages
of the values.

The variability or dispersion concerns how
spread out the values are.
Types of descriptive statistics...
Numerical Summaries of Data
• Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.

• Variation or Variability measures. They

describe “data spread” or how far away the
measurements are from the center.

• Relative Standing measures. They describe

the relative position of specific measurements in the
data.
Mode
The mode is the simply the most popular or most frequent
response value. A data set can have no mode, one mode, or
more than one mode.

To find the mode, order your data set from lowest to highest
and find the response that occurs most frequently.

Eg, Mode number of library visits
Ordered data set

0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3

Why Squared Deviations?

• Adding deviations will yield a sum of ?

• Absolute values do not have nice mathematical
properties
• Squares eliminate the negatives
• Result:
– Increasing contribution to the variance as you go
farther from the mean.
Scale: Standard Deviation

Variance is somewhat arbitrary
• What does it mean to have a variance of 10.8? Or
2.2? Or 1459.092? Or 0.000001?
• Nothing. But if you could “standardize” that value,
you could talk about any variance (i.e. deviation) in
equivalent terms
• Standard deviations are simply the square root of
the variance
Scale: Standard Deviation...

7. Square root – now the value is in the units we started with!!!

Percentiles (Quantiles)
Exploratory Data Analysis

Get Data

Exploratory Data Analysis

Preprocessing

Predictive and Descriptive modeling

Techniques Used In Data Exploration

 In EDA
– The focus is on visualization
– Clustering and anomaly detection can be viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory

 In our discussion of data exploration, we focus

on
1. Summary statistics
2. Visualization
Iris Sample Data Set

 Many of the exploratory data techniques are illustrated

with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
 Setosa
 Virginica
 Versicolour
– Four (non-class) attributes
 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Visualization

Visualization is the conversion of data into a visual

or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.

 Visualization of data is one of the most powerful

and appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature

 The following shows the Sea Surface Temperature

(SST) for July 1982
– Tens of thousands of data points are summarized in a
single figure
Representation

 Is the mapping of information to a visual format

 Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
 Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
Example: Visualizing Universities
Visualization Techniques: Histograms

 Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
 Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
 Show the joint distribution of the values of two
attributes
 Example: petal width and petal length
– What does this tell us?
Visualization Techniques: Histograms

 Several variations of histograms exist: equi-bin(most

popular), other approaches use variable bin sizes…
 Choosing proper bin-sizes and bin-starting points is a non
trivial problem!!
Visualization Techniques: Box Plots

 Box Plots (we do not use the version depicted below!)

– Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot
outlier

90th percentile
Also see: https://en.wikipedia.org/wiki/Box_plot

75th percentile

50th percentile
25th percentile

10th percentile
Boxplots in R (we use those!!)

By default, boxplot() in R plots the maximum and the minimum

non-outlying values instead of the 10th and 90th percentiles as the
book describes. Outliers in BPs are values that are 1.5*IQR or
more away from the box, where IQR is the height of the box!
See:
> a<-c(11,12, 22, 33, 34, 100)
> boxplot(a)
> b<-c(11,12, 22, 33, 34, 65) outlier
> boxplot(b)
 a<-c(11,12,22, 33, 34, 50, 100)
90th percentile Maximum non-outlier
 boxplot(a)

75th percentile

50th percentile IQR (a proxy for standard

deviation)
25th percentile

10th percentile Minimum non-outlier

Example of R Box Plots (Mid1 Question)
b) The following boxplot has been created using the following R-code for an attribute x:
> x<-c(1,2,2,2,4,4,8,9,9,10,18,22)
> boxplot(x)

R version 3.4.3: Kite-Eating Tree

What is the median for the attribute x? What is the IQR for the attribute x? The lower whisker of the boxplot as at
1; what does this tell you? According the boxplot 18 is not an outlier and 22 as an outlier; why do you believe this
is the case?
Median is 6=(4+8)/2
IQR=9.5-2=7.5
1 is the lowest value in the dataset that is not an outlier Every value that is 1.5*IQR above the 75th percentile is
an outlier; that is, for the particular boxplots values above 9.5+1.5*7.5=20.75 and below the 25 th percentile -9.25
are outliers; consequently, 22 is an outlier and 1 and 18 is not, and the whiskers are therefore at 1 and 18!
Attribute Standardization: Z-scores

 Attribute Standardization/Normalizationmakes attributes

equally import, alleviates impact of attribute scale
 Z-score standardization:
– Calculate the mean mf, the standard deviation sf:
– Calculate the standardized measurement (z-score)
xif  m f
zif  s
f

 Result of Z-score standardization is a dataset in which each

attribute has a mean of 0 and a standard deviation of 1.
 The obtained attribute values allow for statistical
interpretation: e.g. if a person’s z-scored age is -1 her age is
on standard deviation below the average age…
 Z-scores can be interpreted based on the 68-95-99.7 Rule!
35
[0,1] Attribute Standardization

Approach: Normalize interval-scaled variables using

where min(x) denotes the minimum value and max(x) denotes the
maximum value of the xth attribute in the data set; that is, all values
of the normalized dataset are numbers in [0,1].

Question: if the normalized value of an attribute is 0; what does this

mean?

36
Visualization Techniques: Scatter Plots

 Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
 For prediction scatter plots see:
http://en.wikipedia.org/wiki/Scatter_plot
http://en.wikipedia.org/wiki/Correlation (Correlation)
 See example for classification, also called supervised scatter
plots, on the next slide
Scatter Plot Array of Iris Attributes
Visualization Techniques: Contour Plots

 Contour plots
– Useful when a continuous attribute is measured on a
spatial grid
– They partition the plane into regions of similar values
– The contour lines that form the boundaries of these
regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air pressure,
etc.
Contour Plot Example: SST Dec, 1998

Celsius
Density Plots

A density plot is a representation of the distribution

of a numeric variable. It uses a kernel density
estimate to show the probability density function of
the variable.
It is a smoothed version of the histogram and is
used in the same concept.
 https://en.wikipedia.org/wiki/Probability_density_fu
nction
 http://ggplot2.tidyverse.org/reference/geom_densit
y_2d.html
 https://python-graph-gallery.com/2d-density-plot/
Density Plots
Visualization Techniques: Parallel Coordinates

 Parallel Coordinates
– Used to plot the attribute values of high-dimensional
data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the
points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of objects
group together, at least for some attributes
– Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
Other Visualization Techniques

 Star Coordinate Plots

– Similar approach to parallel coordinates, but axes radiate from a
central point
– The line connecting the values of an object is a polygon
 Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a characteristic of a
face
– The values of each attribute determine the appearance of the
corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
– http://people.cs.uchicago.edu/~wiseman/chernoff/
– http://kspark.kaist.ac.kr/Human%20Engineering.files/Chernoff/Ch
ernoff%20Faces.htm#
Star Plots for Iris Data

Setosa

Versicolour
Pedal length Sepal Width

Virginica
Sepal length
Pedal width
Chernoff Faces for Iris Data
Translation: sepal lengthsize of face; sepal width forhead/jaw relative to arc-length;
Pedal lengthshape of forhead; pedal width shape of jaw; width of mouth…; width between eyes…

Setosa

Versicolour

Virginica
Thank you!!!

Lecture 2.1 Data - Exploration
No ratings yet
Lecture 2.1 Data - Exploration
22 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
02 Data
No ratings yet
02 Data
62 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Module 1
No ratings yet
Module 1
64 pages
BT 3041: Analysis and Interpretation of Biological Data
No ratings yet
BT 3041: Analysis and Interpretation of Biological Data
57 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
ch2 (Descriptive Statistics)
No ratings yet
ch2 (Descriptive Statistics)
18 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Data Exploration
No ratings yet
Data Exploration
61 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02 Data
No ratings yet
02 Data
64 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Lec 2
No ratings yet
Lec 2
26 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
26 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Lect 3
No ratings yet
Lect 3
51 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
34 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
CH 2
No ratings yet
CH 2
68 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
02 Data
No ratings yet
02 Data
66 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
4 ExploratoryAnalysis
No ratings yet
4 ExploratoryAnalysis
42 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
02 Data
No ratings yet
02 Data
41 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
EDA & Data Visualization Guide
No ratings yet
EDA & Data Visualization Guide
49 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
02 Data
No ratings yet
02 Data
65 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
02 Data
No ratings yet
02 Data
35 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Experimental Psychology Notes
No ratings yet
Experimental Psychology Notes
12 pages
Holt Algebra 1 - Chapter 04 - Quiz 3
No ratings yet
Holt Algebra 1 - Chapter 04 - Quiz 3
3 pages
HCM Mid Term Maissy Saragih and Putri Kayla Management Class 4
No ratings yet
HCM Mid Term Maissy Saragih and Putri Kayla Management Class 4
4 pages
Dissertation On Service Quality and Customer Satisfaction
No ratings yet
Dissertation On Service Quality and Customer Satisfaction
9 pages
BASIC Research Template
No ratings yet
BASIC Research Template
6 pages
Understanding The Effect of School Principals' Leadership Styles On Teacher Job Satisfaction 2024 GABONADA
No ratings yet
Understanding The Effect of School Principals' Leadership Styles On Teacher Job Satisfaction 2024 GABONADA
34 pages
Accident Analysis and Prevention: Articleinfo
No ratings yet
Accident Analysis and Prevention: Articleinfo
8 pages
VC Exit Predictor Technical Documentation
No ratings yet
VC Exit Predictor Technical Documentation
8 pages
Effects of Feedback in English Teaching On Metu Students (1) - Fatma Zehra Alım
No ratings yet
Effects of Feedback in English Teaching On Metu Students (1) - Fatma Zehra Alım
8 pages
Contingency Planning: The Need, Benefits, and Implementation of Scenario Planning
No ratings yet
Contingency Planning: The Need, Benefits, and Implementation of Scenario Planning
11 pages
Chapter One: Basic Statistical Concepts and Notations
No ratings yet
Chapter One: Basic Statistical Concepts and Notations
17 pages
Development of Pavement Management Strategies For
No ratings yet
Development of Pavement Management Strategies For
6 pages
Small Public Open Space IEEE - 1
No ratings yet
Small Public Open Space IEEE - 1
5 pages
MKT20019-Assignment 3 - Group Research Report-Group 2
No ratings yet
MKT20019-Assignment 3 - Group Research Report-Group 2
42 pages
Final - Scientific Method and Intro To Science Reviewer.
No ratings yet
Final - Scientific Method and Intro To Science Reviewer.
7 pages
Surveying Techniques & Errors
No ratings yet
Surveying Techniques & Errors
3 pages
02 Project Report
No ratings yet
02 Project Report
69 pages
Behavioral Finance Case Study
No ratings yet
Behavioral Finance Case Study
15 pages
Chi-Squared Test Worked Example
No ratings yet
Chi-Squared Test Worked Example
2 pages
Does Transformational Leadership Better Predict Work-Related
No ratings yet
Does Transformational Leadership Better Predict Work-Related
14 pages
Screening For Dark Personalities: The Short Dark Tetrad (SD4)
No ratings yet
Screening For Dark Personalities: The Short Dark Tetrad (SD4)
15 pages
Cruise Sales Management Course
No ratings yet
Cruise Sales Management Course
2 pages
Elements of The Sampling Problem: IS T-.S
No ratings yet
Elements of The Sampling Problem: IS T-.S
33 pages
Parts of Research Paper
No ratings yet
Parts of Research Paper
31 pages
Research Methods
38% (8)
Research Methods
55 pages
A Study On Performance Appraisal in Event Management in DSM Textile in Karur
No ratings yet
A Study On Performance Appraisal in Event Management in DSM Textile in Karur
40 pages
Sba 1
No ratings yet
Sba 1
18 pages
Psych 2220 Exam 1 Review Guide
No ratings yet
Psych 2220 Exam 1 Review Guide
16 pages
Chapter 6 (Economic Selection Indexes)
0% (1)
Chapter 6 (Economic Selection Indexes)
40 pages