0% found this document useful (0 votes)

2 views

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing datasets using numerical methods and graphical tools to uncover patterns, trends, and anomalies. The aim of EDA is to maximize insight into data, detect outliers, and develop valid models. EDA methods can be classified into graphical and non-graphical, as well as univariate and multivariate approaches, with a focus on summarizing data through tables and plots.

Uploaded by

Mahim Jain Anwa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Exploratory Data Analysis

Uploaded by

Mahim Jain Anwa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

EXPLORATORY DATA ANALYSIS

(EDA)

1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations from the trend, anomalies and strange
structures.
• It facilitates discovering unexpected as well as conforming the expected.
• Another definition: An approach/philosophy for data analysis that employs a variety of techniques (mostly
graphical).

12
13
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)

14
Classification of EDA*

• Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical
or graphical. And second, each method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics, while graphical methods
obviously summarize the data in a diagrammatic or pictorial way.
• Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more variables.
• It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA
before performing the multivariate EDA.

*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf

15
EDA: Summarizing Data With Tables and Plots

Examine the entire data set using basic techniques before starting a formal statistical analysis.

• Familiarizing yourself with the data.

• Find possible errors and anomalies.
• Examine the distribution of values for each variable.

16
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical.
Plots specific to Continuous variables.

The goal for both categorical and continuous data is data reduction while preserving/extracting key information
about the process under investigation.

17
Categorical Data Summaries

Tables

Cancer site is a variable taking 5 values

• categorical or continuous?
• ordered or unordered?

18
Frequency Table

• Frequency Table: Categories with counts

• Relative Frequency Table: Percentage in each category

19
Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:

20
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:

21
We can then create a frequency table for this new categorical age
variable.

22
Continuous data - plots
A histogram is a bar chart constructed using the frequencies or relative
frequencies of a grouped (or \binned") continuous variable

It discards some information (the exact values), retaining only the

frequencies in each \bin"

23
Age histogram of 10 adult leukemia patients

24
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package

25
Boxplot
> boxplot(mtcars$mpg, main = "Miles per Gallon")

26
The mean is the sum of the data Helpful Hint
values divided by the number of The mean is
data items. sometimes called
The median is the middle value of the average.
an
odd number of data items arranged
in order. For an even number of data items, the
median is the average of the two middle values.

The mode is the value or values that occur most

often. When all of the data values occur the same
number of times, there is no mode.
The range is the difference between the greatest
and least values. It is used to show the spread of
the data in a data set.
Additional Example : Finding the Mean, Median, Mode, and Range of
Data

Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mean:

4 + 7 + 8 + 2 + 1 + 2 + 4 + 2 = 30 Add the values.

8 items sum
Divide the sum by the
30  8 = 3.75 number of items.

The mean is 3.75.

Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

median:

1, 2, 2, 2, 4, 4, 7, 8 Arrange the values in order.

2+4=6 There are two middle values, so

find the mean of these two values.
62=3

The median is 3.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mode:

1, 2, 2, 2, 4, 4, 7, 8 The value 2 occurs three times.

The mode is 2.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

range:

1, 2, 2, 2, 4, 4, 7, 8 Subtract the least value

from the greatest value.

8 –1= 7

The range is 7.
▪ The mean and median are measures of central tendency used to
represent the “middle” of a data set.
▪ To decide which measure is most appropriate for describing a set of
data, think about what each measure tells you about the data.
▪ The measure that you choose may depend on how the information in
the data set is being used.
With the Outlier

55, 88, 89, 90, 94

outlier 55

mean: median: mode:

55+88+89+90+94 = 416 55, 88, 89, 90, 94

416  5 = 83.2
The mean is 83.2. The median is 89. There is no mode.
Without the Outlier

55, 88, 89, 90, 94

mean: median: mode:

88+89+90+94 = 361 88, 89,+90, 94

361  4 = 90.25 2
= 89.5
The mean is 90.25.
The median is 89.5.There is no mode.
Additional Example 3 Continued

Without the Outlier With the Outlier

mean 90.25 83.2

median 89.5 89
mode no mode no mode

Adding the outlier decreased the mean by 7.05 and the

median by 0.5.

The mode did not change.

Flight DXB Jeddah Return Anwar Team Neom 07.07.2022
No ratings yet
Flight DXB Jeddah Return Anwar Team Neom 07.07.2022
4 pages
Compressor IRN200H-Of - Operacao, Manutencao, Instalacao
No ratings yet
Compressor IRN200H-Of - Operacao, Manutencao, Instalacao
72 pages
Them Bombs - Manual (En 1.1)
67% (3)
Them Bombs - Manual (En 1.1)
27 pages
program-1_
No ratings yet
program-1_
15 pages
7u7 PDF
No ratings yet
7u7 PDF
31 pages
Chapter 10 Data Analysis-Quantitative
No ratings yet
Chapter 10 Data Analysis-Quantitative
93 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
53 pages
Angilan, Ef
No ratings yet
Angilan, Ef
5 pages
Week7_Measures of Central Tendency
No ratings yet
Week7_Measures of Central Tendency
46 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Gtu 302 Biostatistics: Descriptive Statistics
100% (1)
Gtu 302 Biostatistics: Descriptive Statistics
57 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Lecture 1 - Measures of Central Tendency (1)
No ratings yet
Lecture 1 - Measures of Central Tendency (1)
3 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Central Tendency and Dispersion
No ratings yet
Central Tendency and Dispersion
61 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
DeMeasure of central tendency and dispersion
No ratings yet
DeMeasure of central tendency and dispersion
15 pages
Week 1 - Describing Data 2
No ratings yet
Week 1 - Describing Data 2
28 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
24 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
19 pages
Descriptive analysis in tableau
No ratings yet
Descriptive analysis in tableau
4 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
SINGLE VARIABLE Notes 5.3 Year 10
No ratings yet
SINGLE VARIABLE Notes 5.3 Year 10
9 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Unit 3
No ratings yet
Unit 3
77 pages
Measure of Central Tendency: Measure of Location: Goals
No ratings yet
Measure of Central Tendency: Measure of Location: Goals
7 pages
SPSS Data Analysis
100% (6)
SPSS Data Analysis
47 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Week 4 Measures of Central Tendency
No ratings yet
Week 4 Measures of Central Tendency
29 pages
Measures of Central Tendency Ungrouped Data
No ratings yet
Measures of Central Tendency Ungrouped Data
28 pages
Introduction To Statistics and Data Analysis: Detailed Introductory Part of Statistics L2
No ratings yet
Introduction To Statistics and Data Analysis: Detailed Introductory Part of Statistics L2
80 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Research Report
No ratings yet
Research Report
47 pages
JIM 103
No ratings yet
JIM 103
41 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
LU 3 Descriptive Statistics in SPSS
No ratings yet
LU 3 Descriptive Statistics in SPSS
60 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Lesson 1. Measures of Central Tendency (Ungrouped Data)
No ratings yet
Lesson 1. Measures of Central Tendency (Ungrouped Data)
21 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
BStats 1
No ratings yet
BStats 1
66 pages
Ug Stat Pract Manual
100% (1)
Ug Stat Pract Manual
108 pages
chap2b
No ratings yet
chap2b
15 pages
2.1 Measures of Central Tendency
No ratings yet
2.1 Measures of Central Tendency
32 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
MATH30-6-Lecture-3
No ratings yet
MATH30-6-Lecture-3
66 pages
Module-4.-Part1-Analyzing-and-Interpreting-Data
No ratings yet
Module-4.-Part1-Analyzing-and-Interpreting-Data
41 pages
7 QC Tools
No ratings yet
7 QC Tools
40 pages
Chapter 2 Final of Final
No ratings yet
Chapter 2 Final of Final
158 pages
Chapter3 - Measures of Central Tendency Ungrouped Data
No ratings yet
Chapter3 - Measures of Central Tendency Ungrouped Data
25 pages
Unit Iii
No ratings yet
Unit Iii
152 pages
Decriptive Statistics in Data Science
No ratings yet
Decriptive Statistics in Data Science
9 pages
Chapter4 Statistics
No ratings yet
Chapter4 Statistics
108 pages
Descriptive Statistics PDF
No ratings yet
Descriptive Statistics PDF
130 pages
QMM 3
No ratings yet
QMM 3
67 pages
Educational Statistics Notes
No ratings yet
Educational Statistics Notes
32 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Assignment 1
No ratings yet
Assignment 1
2 pages
DataScience_1
No ratings yet
DataScience_1
22 pages
Assignment (1)
No ratings yet
Assignment (1)
3 pages
adverse_reactions.csv
No ratings yet
adverse_reactions.csv
1 page
assignment_8_report
No ratings yet
assignment_8_report
13 pages
Quantum Theory
No ratings yet
Quantum Theory
65 pages
2 EL Div Curl EB
No ratings yet
2 EL Div Curl EB
50 pages
4 EL EMWaves Vaccum BC
No ratings yet
4 EL EMWaves Vaccum BC
46 pages
1 EL Vectors
No ratings yet
1 EL Vectors
59 pages
3 EL Maxwells Eqns
No ratings yet
3 EL Maxwells Eqns
23 pages
Chapter01 FuelingtheFuture
No ratings yet
Chapter01 FuelingtheFuture
108 pages
East Central Railway
No ratings yet
East Central Railway
84 pages
DHCP Server Setup For Ubuntu
No ratings yet
DHCP Server Setup For Ubuntu
5 pages
1 Print Application
No ratings yet
1 Print Application
7 pages
Vibroacoustic Loudspeaker Simulation: Multiphysics With BEM-FEM
No ratings yet
Vibroacoustic Loudspeaker Simulation: Multiphysics With BEM-FEM
8 pages
TLE-ICT-WEEK-4-DLL Done Page 23-38
100% (3)
TLE-ICT-WEEK-4-DLL Done Page 23-38
15 pages
Annex SL Excerpt - 2015 6th Edition - Hls and Guidance Only PDF
No ratings yet
Annex SL Excerpt - 2015 6th Edition - Hls and Guidance Only PDF
23 pages
Printed by SYSUSER: Dial Toll Free 1912 For Bill & Supply Complaints
No ratings yet
Printed by SYSUSER: Dial Toll Free 1912 For Bill & Supply Complaints
1 page
Automatic Extraction and Analysis of Lineament Features Using ASTER and Sentinel 1 SAR Data
No ratings yet
Automatic Extraction and Analysis of Lineament Features Using ASTER and Sentinel 1 SAR Data
15 pages
Mango Classification Ppt
No ratings yet
Mango Classification Ppt
12 pages
Mamta Ahuja
No ratings yet
Mamta Ahuja
3 pages
MySQL Crash Course 2nd Edition Forta - The special ebook edition is available for download now
100% (1)
MySQL Crash Course 2nd Edition Forta - The special ebook edition is available for download now
62 pages
Pentair Goyen Close Pitched Valve
No ratings yet
Pentair Goyen Close Pitched Valve
13 pages
Security Mechanisms:-: Encipherment
No ratings yet
Security Mechanisms:-: Encipherment
64 pages
CC Unit-4
No ratings yet
CC Unit-4
33 pages
Compensation Metrics Cheat Sheet: Leverage Data To Overcome Your Most Pressing C&B Challenges
No ratings yet
Compensation Metrics Cheat Sheet: Leverage Data To Overcome Your Most Pressing C&B Challenges
21 pages
Module 3
No ratings yet
Module 3
56 pages
Ice Block Machine
No ratings yet
Ice Block Machine
15 pages
G I C Lecturer Syllabus
No ratings yet
G I C Lecturer Syllabus
4 pages
Aj5513e FZ3C
No ratings yet
Aj5513e FZ3C
2 pages
Heavy Equipment Spare Parts
No ratings yet
Heavy Equipment Spare Parts
9 pages
Flagship Range Catalogue
No ratings yet
Flagship Range Catalogue
65 pages
21EE742 Module - 3
No ratings yet
21EE742 Module - 3
21 pages
REPORT - DRONE AND IMPROVED HUMAN DETECTION IN SEA USING PI PICO New
No ratings yet
REPORT - DRONE AND IMPROVED HUMAN DETECTION IN SEA USING PI PICO New
52 pages
DO021 - Safezone 3D FZU 1047
No ratings yet
DO021 - Safezone 3D FZU 1047
2 pages
中建六局简介（资质版英文）
No ratings yet
中建六局简介（资质版英文）
12 pages
Double Leg Brush Holder Type DDG For Slip Rings (With Self-Aligning Brushes)
No ratings yet
Double Leg Brush Holder Type DDG For Slip Rings (With Self-Aligning Brushes)
1 page