0% found this document useful (0 votes)
2 views

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing datasets using numerical methods and graphical tools to uncover patterns, trends, and anomalies. The aim of EDA is to maximize insight into data, detect outliers, and develop valid models. EDA methods can be classified into graphical and non-graphical, as well as univariate and multivariate approaches, with a focus on summarizing data through tables and plots.

Uploaded by

Mahim Jain Anwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing datasets using numerical methods and graphical tools to uncover patterns, trends, and anomalies. The aim of EDA is to maximize insight into data, detect outliers, and develop valid models. EDA methods can be classified into graphical and non-graphical, as well as univariate and multivariate approaches, with a focus on summarizing data through tables and plots.

Uploaded by

Mahim Jain Anwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

EXPLORATORY DATA ANALYSIS

(EDA)

1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations from the trend, anomalies and strange
structures.
• It facilitates discovering unexpected as well as conforming the expected.
• Another definition: An approach/philosophy for data analysis that employs a variety of techniques (mostly
graphical).

12
13
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)

14
Classification of EDA*

• Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical
or graphical. And second, each method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics, while graphical methods
obviously summarize the data in a diagrammatic or pictorial way.
• Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more variables.
• It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA
before performing the multivariate EDA.

*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf


15
EDA: Summarizing Data With Tables and Plots

Examine the entire data set using basic techniques before starting a formal statistical analysis.

• Familiarizing yourself with the data.


• Find possible errors and anomalies.
• Examine the distribution of values for each variable.

16
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical.
Plots specific to Continuous variables.

The goal for both categorical and continuous data is data reduction while preserving/extracting key information
about the process under investigation.

17
Categorical Data Summaries

Tables

Cancer site is a variable taking 5 values


• categorical or continuous?
• ordered or unordered?

18
Frequency Table

• Frequency Table: Categories with counts


• Relative Frequency Table: Percentage in each category

19
Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:

20
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:

21
We can then create a frequency table for this new categorical age
variable.

22
Continuous data - plots
A histogram is a bar chart constructed using the frequencies or relative
frequencies of a grouped (or \binned") continuous variable

It discards some information (the exact values), retaining only the


frequencies in each \bin"

23
Age histogram of 10 adult leukemia patients

24
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package

25
Boxplot
> boxplot(mtcars$mpg, main = "Miles per Gallon")

26
The mean is the sum of the data Helpful Hint
values divided by the number of The mean is
data items. sometimes called
The median is the middle value of the average.
an
odd number of data items arranged
in order. For an even number of data items, the
median is the average of the two middle values.

The mode is the value or values that occur most


often. When all of the data values occur the same
number of times, there is no mode.
The range is the difference between the greatest
and least values. It is used to show the spread of
the data in a data set.
Additional Example : Finding the Mean, Median, Mode, and Range of
Data

Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mean:

4 + 7 + 8 + 2 + 1 + 2 + 4 + 2 = 30 Add the values.

8 items sum
Divide the sum by the
30  8 = 3.75 number of items.

The mean is 3.75.


Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

median:

1, 2, 2, 2, 4, 4, 7, 8 Arrange the values in order.

2+4=6 There are two middle values, so


find the mean of these two values.
62=3

The median is 3.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mode:

1, 2, 2, 2, 4, 4, 7, 8 The value 2 occurs three times.

The mode is 2.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

range:

1, 2, 2, 2, 4, 4, 7, 8 Subtract the least value


from the greatest value.

8 –1= 7

The range is 7.
▪ The mean and median are measures of central tendency used to
represent the “middle” of a data set.
▪ To decide which measure is most appropriate for describing a set of
data, think about what each measure tells you about the data.
▪ The measure that you choose may depend on how the information in
the data set is being used.
With the Outlier

55, 88, 89, 90, 94


outlier 55

mean: median: mode:

55+88+89+90+94 = 416 55, 88, 89, 90, 94


416  5 = 83.2
The mean is 83.2. The median is 89. There is no mode.
Without the Outlier

55, 88, 89, 90, 94

mean: median: mode:

88+89+90+94 = 361 88, 89,+90, 94


361  4 = 90.25 2
= 89.5
The mean is 90.25.
The median is 89.5.There is no mode.
Additional Example 3 Continued

Without the Outlier With the Outlier

mean 90.25 83.2


median 89.5 89
mode no mode no mode

Adding the outlier decreased the mean by 7.05 and the


median by 0.5.

The mode did not change.


43

You might also like