0% found this document useful (0 votes)
83 views

Unit 3 Notes

Exploratory data analysis (EDA) refers to analyzing data sets to understand their key characteristics and relationships. The goals of EDA include data cleaning, descriptive statistics, data visualization, feature engineering, identifying correlations and relationships, data segmentation, hypothesis generation, and data quality assessment. There are different types of EDA, including univariate analysis of single variables, bivariate analysis of relationships between pairs of variables, multivariate analysis of interactions between multiple variables, and time series analysis of data with a temporal component. EDA techniques help explore data, discover patterns, and gain insights to inform further formal statistical analysis or modeling.

Uploaded by

patilamrutak2003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Unit 3 Notes

Exploratory data analysis (EDA) refers to analyzing data sets to understand their key characteristics and relationships. The goals of EDA include data cleaning, descriptive statistics, data visualization, feature engineering, identifying correlations and relationships, data segmentation, hypothesis generation, and data quality assessment. There are different types of EDA, including univariate analysis of single variables, bivariate analysis of relationships between pairs of variables, multivariate analysis of interactions between multiple variables, and time series analysis of data with a temporal component. EDA techniques help explore data, discover patterns, and gain insights to inform further formal statistical analysis or modeling.

Uploaded by

patilamrutak2003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Typical data format and the types of EDA,

Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA) refers to the method of studying and exploring
record sets to apprehend their predominant traits, discover patterns, locate
outliers, and identify relationships between variables. EDA is normally carried
out as a preliminary step before undertaking extra formal statistical analyses or
modeling.

The Foremost Goals of EDA

1. Data Cleaning: EDA involves examining the information for errors, lacking
values, and inconsistencies. It includes techniques including records imputation,
managing missing statistics, and figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the
important tendency, variability, and distribution of variables. Measures like
suggest, median, mode, preferred deviation, range, and percentiles are usually
used.
3. Data Visualization: EDA employs visual techniques to represent the
statistics graphically. Visualizations consisting of histograms, box plots, scatter
plots, line plots, heatmaps, and bar charts assist in identifying styles, trends,
and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables
and their adjustments to create new functions or derive meaningful insights.
Feature engineering can contain scaling, normalization, binning, encoding
express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and
dependencies between variables. Techniques such as correlation analysis,
scatter plots, and pass-tabulations offer insights into the power and direction of
relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into
significant segments based totally on sure standards or traits. This
segmentation allows advantage insights into unique subgroups inside the
information and might cause extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies
questions based totally on the preliminary exploration of the data. It facilitates
form the inspiration for in addition evaluation and model building.
8. Data Quality Assessment: EDA permits for assessing the nice and reliability
of the information. It involves checking for records integrity, consistency, and
accuracy to make certain the information is suitable for analysis.

Types of EDA

Depending on the number of columns we are analyzing we can divide EDA into
two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and
analyzing information units to uncover styles, pick out relationships, and gain
insights. There are various sorts of EDA strategies that can be hired relying on
the nature of the records and the desires of the evaluation. Here are some not
unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing
character variables inside the records set. It involves summarizing and
visualizing a unmarried variable at a time to understand its distribution, relevant
tendency, unfold, and different applicable records. Techniques like histograms,
field plots, bar charts, and precis information are generally used in univariate
analysis.
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection
between variables. It enables find associations, correlations, and dependencies
between pairs of variables. Scatter plots, line plots, correlation matrices, and
move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to
encompass greater than variables. It ambitions to apprehend the complex
interactions and dependencies among more than one variables in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and
primary component analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics
sets that have a temporal component. Time collection evaluation entails
inspecting and modeling styles, traits, and seasonality inside the statistics
through the years. Techniques like line plots, autocorrelation analysis,
transferring averages, and ARIMA (AutoRegressive Integrated Moving Average)
fashions are generally utilized in time series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in
datasets, and it may impact the reliability and validity of the evaluation. Missing
statistics analysis includes figuring out missing values, know-how the patterns
of missingness, and using suitable techniques to deal with missing data.
Techniques along with lacking facts styles, imputation strategies, and sensitivity
evaluation are employed in lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from
the general sample of the facts. Outlier analysis includes identifying and
knowledge the presence of outliers, their capability reasons, and their impact at
the analysis. Techniques along with box plots, scatter plots, z-rankings, and
clustering algorithms are used for outlier evaluation.
7. Data Visualization: Data visualization is a critical factor of EDA that entails
creating visible representations of the statistics to facilitate understanding and
exploration. Various visualization techniques, inclusive of bar charts,
histograms, scatter plots, line plots, heatmaps, and interactive dashboards, are
used to represent exclusive kinds of statistics.
These are just a few examples of the types of EDA techniques that can be
employed at some stage in information evaluation. The choice of strategies
relies upon on the information traits, research questions, and the insights
sought from the analysis.

OBJECTIVES OF EXPLORATORY DATA ANALYSIS


The objectives of exploratory data analysis include, but not limited to:

1. identifying data outliers,


2. identifying trends in time and space,
3. detecting patterns of interest,
4. generating hypotheses,
5. opening opportunities for new ways to collect data, and
6. enabling hypothesis testing through experiments.

TYPES OF EXPLORATORY DATA ANALYSIS:


1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
1. Univariate Non-graphical: this is the simplest form of data analysis as
during this we use just one variable to research the info. The standard goal of
univariate non-graphical EDA is to know the underlying sample distribution/
data and make observations about the population. Outlier detection is
additionally part of the analysis. The characteristics of population distribution
include:
 Central tendency: The central tendency or location of distribution has got
to do with typical or middle values. The commonly useful measures of
central tendency are statistics called mean, median, and sometimes mode
during which the foremost common is mean. For skewed distribution or when
there’s concern about outliers, the median may be preferred.
 Spread: Spread is an indicator of what proportion distant from the middle
we are to seek out the find the info values. the quality deviation and variance
are two useful measures of spread. The variance is that the mean of the
square of the individual deviations and therefore the variance is the root of
the variance
 Skewness and kurtosis: Two more useful univariates descriptors are the
skewness and kurtosis of the distribution. Skewness is that the measure of
asymmetry and kurtosis may be a more subtle measure of peakedness
compared to a normal distribution
2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is
usually wont to show the connection between two or more variables within the
sort of either cross-tabulation or statistics.
 For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For 2 variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one-variable
and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
 For each categorical variable and one quantitative variable, we create
statistics for quantitative variables separately for every level of the specific
variable then compare the statistics across the amount of categorical
variable.
 Comparing the means is an off-the-cuff version of ANOVA and comparing
medians may be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective,
they are not able to give the complete picture of the data; therefore, graphical
methods are used more as they involve a degree of subjective analysis, also
are required. Common sorts of univariate graphics are:
 Histogram: The foremost basic graph is a histogram, which may be a
barplot during which each bar represents the frequency (count) or proportion
(count/total count) of cases for a variety of values. Histograms are one of the
simplest ways to quickly learn a lot about your data, including central
tendency, spread, modality, shape and outliers.
 Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-
leaf plots. It shows all data values and therefore the shape of the distribution.
 Boxplots: Another very useful univariate graphical technique is that the
boxplot. Boxplots are excellent at presenting information about central
tendency and show robust measures of location and spread also as
providing information about symmetry and outliers, although they will be
misleading about aspects like multimodality. One among the simplest uses
of boxplots is within the sort of side-by-side boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA technique is
that the most intricate. it’s called the quantile-normal or QN plot or more
generally the quantile-quantile or QQ plot. it’s wont to see how well a specific
sample follows a specific theoretical distribution. It allows detection of non-
normality and diagnosis of skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used
commonly may be a grouped barplot with each group representing one level of
1 of the variables and every bar within a gaggle representing the amount of the
opposite variable.
Other common sorts of multivariate graphics are:
 Scatterplot: For 2 quantitative variables, the essential graphical EDA
technique is that the scatterplot , sohas one variable on the x-axis and one
on the y-axis and therefore the point for every case in your dataset.
 Run chart: It’s a line graph of data plotted over time.
 Heat map: It’s a graphical representation of data where values are depicted
by color.
 Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
 Bubble chart: It’s a data visualization that displays multiple circles (bubbles)
in two-dimensional plot.

You might also like