Dr.
Lázaro Bustio Martínez
[email protected]Otoño 2023
Estadístic
a
descriptiv
a
• Elementos de la estadística descriptiva
• Análisis exploratorio de datos
Agenda • Identificar los elementos de estadística
descriptiva y aplicarlos a un dataset
pequeño.
Introduction
• In all research, and before drawing conclusions about the objectives and hypotheses
proposed, it is necessary to carry out a preliminary and exploratory analysis of the data in
order to detect errors in the coding of the variables, eliminate inconsistencies, evaluate
the magnitude and type of missing data, learn about the basic characteristics of the
distribution of the variables (normality, equality of variances, presence of outliers,
linearity, etc.) and make progress on the relationships between them.
Introduction Tipo de
variable
Índices analíticos Representaciones
gráficas
Cuantitativa media, mediana, histograma,
• Most of these objectives are achieved by moda, desviación gráfico de caja
performing a descriptive analysis of the típica, rango,
amplitud
variables. Specifically, it is used measures of intercuartílica,
central tendency and dispersion to describe prueba de
normalidad
the characteristics of quantitative variables and
tables of frequencies and percentages for
Cualitativa frecuencias, diagrama de
qualitative variables. porcentajes, barras, diagrama
moda, etc. de líneas,
diagrama de
sectores
• Based on insights developed
at Bell Labs in the 60’s.
• Technique for visualizing and
Explorator summarizing data.
y Data • What can the data tell us? (in
contrast to “confirmatory”
Analysis data analysis)
• Introduced many basic
1977 techniques:
• 5-number summary, box
plots, stem and leaf
diagrams.
• 5 number summary:
• Extremes (min and max)
• Median and quartiles
Aim of the EDA
01 02 03 04 05 06 07
Maximize Uncover Extract Detect Test Develop valid Determine
insight into a underlying important outliers and underlying models optimal factor
dataset structure variables anomalies assumptions settings (Xs)
Aim of the EDA
• The goal of EDA is to open-mindedly explore data.
• Tukey: EDA is detective work… Unless detective finds the clues, judge or jury
has nothing to consider.
• Here, judge or jury is a confirmatory data analysis
• Tukey: Confirmatory data analysis goes further, assessing the strengths of the
evidence.
• With EDA, we can examine data and try to understand the meaning of
variables. What are the abbreviations stand for.
Exploratory vs Confirmatory Data Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models
• Descriptive Statistics • Inferential Statistics
• Graphical • EDA and theory driven
• Data driven
Pipleline of EDA
Generate good Data Based on the Try to identify Handle missing Decide on the Decide on the
research restructuring: You research confounding observations need of hypothesis based
questions may need to questions, use variables, transformation on your research
make new appropriate interaction (on response questions
variables from graphical tools relations and and/or
the existing ones. and obtain multicollinearity, explanatory
Instead of using two descriptive if any. variables).
variables, obtaining statistics. Try to
rates or percentages of
them
understand the
Creating dummy
data structure,
variables for categorical relationships,
variables anomalies,
unexpected
behaviors.
After EDA
CONFIRMATORY DATA ANALYSIS: VERIFY GET CONCLUSIONS AND PRESENT YOUR
THE HYPOTHESIS BY STATISTICAL RESULTS NICELY.
ANALYSIS
Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First, each method is
either non-graphical or graphical. And second, each method is either univariate or
multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics, while graphical
methods obviously summarize the data in a diagrammatic or pictorial way.
• Univariate methods look at one variable (data column) at a time, while multivariate
methods look at two or more variables at a time to explore relationships. Usually, our
multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will
involve three or more variables.
• It is almost always a good idea to perform univariate EDA on each of the components of a
multivariate EDA before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
Exploratory data analysis tools
• Specific statistical functions and techniques you can perform with EDA tools include:
• Clustering and dimension reduction techniques, which help create graphical displays of high-
dimensional data containing many variables.
• Univariate visualization of each field in the raw dataset, with summary statistics.
• Bivariate visualizations and summary statistics that allow you to assess the relationship between
each variable in the dataset and the target variable you’re looking at.
• Multivariate visualizations, for mapping and understanding interactions between different fields in
the data.
• K-means Clustering is a clustering method in unsupervised learning where data points are assigned
into K groups, i.e., the number of clusters, based on the distance from each group’s centroid. The
data points closest to a particular centroid will be clustered under the same category. K-means
Clustering is commonly used in market segmentation, pattern recognition, and image compression.
• Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of exploratory data analysis
• Other common types of multivariate graphics include:
• Scatter plot, which is used to plot data points on a horizontal and a vertical axis to
show how much one variable is affected by another.
• Multivariate chart, which is a graphical representation of the relationships between
factors and a response.
• Run chart, which is a line graph of data plotted over time.
• Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.
• Heat map, which is a graphical representation of data where values are depicted by
color.
Example
• Data from the Places Rated Almanac (Boyer and Savageau, 1985) 9 variables from 329
metropolitan areas in the USA:
• Climate mildness
• Housing cost
Questions:
• Health care and environment 1. How is climate related to location?
• Crime 2. Are there clusters in the data (excluding
• Transportation supply location)?
• Educational opportunities and effort 3. Are nearby cities similar?
• Arts and culture facilities 4. Any relation between economic outlook and
• Recreational opportunities crime?
• Personal economic outlook 5. What else???
• + latitude and longitude of each city
What is data?
• Categorical (Qualitative)
• Nominal scales – number is just a symbol that identifies a quality
• 0=male, 1=female
• 1=green, 2=blue, 3=red, 4=white
• Ordinal – rank order
• Quantitative (continuous and discrete)
• Interval – units are of identical size (i.e., Years)
• Ratio – distance from an absolute zero (i.e., Age, reaction time)
What is a measurement?
• Every measurement has 2 parts:
• The True Score (the actual state of things in the world)
• and
• ERROR! (mistakes, bad measurement, report bias, context effects, etc.)
• X=T+e
Organizing your data in a spreadsheet
Subject condition score
1 before 3
• Stacked data: 1 during 2
• Multiple cases (rows) for each subject 1 after 5
2 before 3
2 during 8
2 after 4
3 before 3
3 during 7
3 after 1
Subject before during after
• Unstacked data: 1 3 2 5
• Only one case (row) per subject 2 3 8 4
3 3 7 1
Variable Summaries
Indices of central tendency:
• Mean – the average value
• Median – the middle value
• Mode – the most frequent value
Indices of Variability:
• Variance – the spread around the mean
• Standard deviation
• Standard error of the mean (estimate)
Subject before during after
1
2
3
3
2
8
7
4
The Mean
3 3 7 3
• Mean = sum of all scores divided by number of
4 3 2 6 scores:
5 3 8 4
6 3 1 6
7 3 9 3
8 3 3 6
9 3 9 4
10 3 1 7
Sum = 30 50 50
n= 10 10 10
Mean = 3 5 5
The Variance: Sum of the
squared deviations divided
by number of scores
• In probability theory and statistics, variance is the
expectation of the squared deviation of a random
variable from its mean. Variance is a measure of
dispersion, meaning it is a measure of how far a set
of numbers is spread out from their average value.
The Variance: Sum of the squared deviations
divided by number of scores
Before Before During During After - After –
Subject before during after -mean – mean2 - mean – mean2 mean mean2
1 3 2 7 0 0 -3 9 2 4
2 3 8 4 0 0 3 9 -1 1
3 3 7 3
0 0 2 4 -2 4
4 3 2 6
0 0 -3 9 1 1
5 3 8 4
0 0 3 9 -1 1
6 3 1 6
0 0 -4 16 1 1
7 3 9 3
0 0 4 16 -2 4
8 3 3 6
0 0 -2 4 1 1
9 3 9 4
0 0 4 16 -1 1
10 3 1 7
0 0 -4 16 2 4
Sum = 30 50 50
0 0 0 108 0 22
n= 10 10 10
10 10 10
Mean = 3 5 5
VAR = 0 10.8 2.2
Variance continued
8.00
8.00 8.00
6.00
6.00 6.00
before
4.00
4.00 4.00
mean
2.00 2.00 2.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
subject subject subject
• Means and variances are ways to describe a distribution of
scores.
• Knowing about your distributions is one of the best ways to
understand your data.
Distribution • A NORMAL (aka Gaussian) distribution is the most common
assumption of statistics; thus, it is often important to check
if your data are normally distributed.
What is “normal” anyway?
• With enough measurements, most variables are distributed normally.
• But in order to fully describe data we need to introduce the idea of a standard deviation.
leptokurtic
platokurtic
Standard deviation
• Variance, as calculated earlier, is arbitrary.
• What does it mean to have a variance of 10.8? Or 2.2? Or 1459.092? Or 0.000001?
• Nothing. But if you could “standardize” that value, you could talk about any variance (i.e.,
deviation) in equivalent terms.
• Standard Deviations are simply the square root of the variance.
Standard deviation
• The process of standardizing deviations goes like this:
1. Score (in the units that are meaningful)
2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n (if population) or n-1 (if sample)
7. Square root – now the value is in the units we started with!!!
Interpreting standard deviation (SD)
• First, the SD will let you know about the distribution of scores around the mean.
• High SDs (relative to the mean) indicate the scores are spread out
• Low SDs tell you that most scores are very near the mean.
High SD Low SD
Interpreting standard deviation (SD)
• Second, you can then interpret any individual score in terms of the SD.
• For example:
• mean = 50, SD = 10
• versus mean = 50, SD = 1
• A score of 55 is:
• 0.5 Standard deviation units from the mean (not much)
• OR
• 5 standard deviation units from mean (a lot!)
Standardized scores (Z)
• Third, you can use SDs to create standardized scores
• Force the scores onto a normal distribution by putting each score into units of SD.
• Subtract the mean from each score and divide by SD:
Standardi
zed
normal • ALL Z-scores have a mean
of 0 and SD of 1. Nice and
distributio simple.
• From this we can get the
n proportion of scores
anywhere in the
distribution.
The trouble with normal
• We violate assumptions about statistical tests if the distributions of our variables are not
approximately normal.
• Thus, we must first examine each variable’s distribution and adjust when necessary, so
that assumptions are met.
• Examine every variable for:
• Out of range values
• Normality
Following • Outliers
• It is necessary to get a
table of each variable
with each value and its
Checking frequency of occurrence.
data • Best way to examine
categorical variables is by
checking their
frequencies
Subject before during after
Visual display of
1
2
3.1 2.3
8.8
7
4.2
univariate data
3.2
3 2.8 7.1 3.2 • Now the example data from before has decimals
4 3.3 2.3 6.7 • What kind of data is that?
5 3.3 8.6 4.5
• Precision has increased
6 3.3 1.5 6.6
7 2.8 9.1 3.4
8 3 3.3 6.5
9 3.1 9.5 4.1
10 3 1 7.3
Subject before during after
Visual display of
1
2
3.1 2.3
8.8
7
4.2
univariate data
3.2
3 2.8 7.1 3.2 • Histograms
4 3.3 2.3 6.7
• Stem and Leaf plots
5 3.3 8.6 4.5
• Boxplots
6 3.3 1.5 6.6
7 2.8 9.1 3.4
• QQ Plots
8 3 3.3 6.5 • …and many many more.
9 3.1 9.5 4.1
10 3 1 7.3
So…what do you do?
IF YOU FIND A MISTAKE, FIX IT. IF YOU FIND AN OUTLIER, TRIM IF YOUR DISTRIBUTIONS ARE
IT OR DELETE IT. ASKEW, TRANSFORM THE DATA.
Dealing with Outliers
• First, try to explain it.
• In a normal distribution 0.4% are outliers (>2.7 SD) and 1 in a million is an extreme outlier
(>4.72 SD).
• For analyses you can:
• Delete the value – crude but effective
• Change the outlier to value ~3 SD from mean
• “Winsorize” it (make = to next highest value)
• “Trim” the mean – recalculate mean from data within interquartile range
Scales of Graphs
It is very important to pay attention to the scale Compare the following graphs created from
that you are using when you are plotting. identical data
Scales of Graph
18 30
20
10
-10
-2 -20
before during after followup before during after followup
3
M
e
a
n 2
before during after followup
Steps in Data Exploration and Processing
1. Identification of variables and data types
2. Analyzing the basic metrics
3. Non-Graphical Univariate Analysis
4. Graphical Univariate Analysis
5. Bivariate Analysis
6. Variable transformations
7. Missing value treatment
8. Outlier treatment
9. Correlation Analysis
10. Dimensionality Reduction
Summary
01 02 03
Examine all your Use visual displays Transform each
variables thoroughly whenever possible variable as necessary
and carefully before to deal with
you begin analysis mistakes, outliers,
and distributions
Tarea
• Desarrollar el ejemplo que se describe en el artículo “Exploratory data analysis in Python”,
de Tanu N. Prabhu
• https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
• Documentar el ejercicio realizado en Github.