0% found this document useful (0 votes)
20 views

Unit 3

Uploaded by

Avish Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 3

Uploaded by

Avish Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT-3

1. Data Analytics
Data analytics is the process of examining raw data with the purpose of drawing conclusions about that information.
It involves various techniques and methods to analyze, interpret, and derive insights from data. Here are some key
aspects of data analytics:

Data Collection: Data analytics begins with the collection of relevant data. This can involve gathering data from
various sources such as databases, spreadsheets, sensors, social media, etc. Ensuring data quality is essential at this
stage to ensure the accuracy and reliability of analysis.

Data Cleaning and Preparation: Raw data often contains errors, missing values, and inconsistencies. Data cleaning
involves processes such as removing duplicates, filling missing values, standardizing formats, and correcting errors to
prepare the data for analysis.

Exploratory Data Analysis (EDA): EDA involves examining the data to understand its characteristics, identify
patterns, and explore relationships between variables. Visualization techniques such as histograms, scatter plots, and
box plots are commonly used in EDA to gain insights into the data.

Descriptive Analytics: Descriptive analytics focuses on summarizing and describing the main features of the data.
This can include measures such as mean, median, mode, variance, standard deviation, and percentiles to describe
the central tendency and variability of the data.

Predictive Analytics: Predictive analytics involves using statistical and machine learning techniques to analyze
historical data and make predictions about future outcomes. This can include techniques such as regression analysis,
time series forecasting, classification, and clustering.

Prescriptive Analytics: Prescriptive analytics goes beyond predicting future outcomes to provide recommendations
on the actions to take to achieve desired outcomes. It involves optimization and simulation techniques to identify the
best course of action based on the analysis of available data.

Data Visualization: Data visualization is an important aspect of data analytics that involves presenting data in visual
formats such as charts, graphs, and dashboards. Visualization helps in communicating insights effectively and making
data-driven decisions.

Data Interpretation and Reporting: Finally, data analytics involves interpreting the results of analysis and
communicating insights to stakeholders. This often involves creating reports, presentations, and dashboards to
convey findings and recommendations.

2. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is normally
carried out as a preliminary step before undertaking extra formal statistical analyses or modeling.

The Foremost Goals of EDA

1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It includes
techniques including records imputation, managing missing statistics, and figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles are
usually used.

3. Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations consisting
of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles, trends, and
relationships within the facts.

4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create new
functions or derive meaningful insights. Feature engineering can contain scaling, normalization, binning, encoding
express variables, and creating interplay or derived variables.

5. Correlation and Relationships: EDA allows discover relationships and dependencies between variables.
Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and
direction of relationships between variables.

6. Data Segmentation: EDA can contain dividing the information into significant segments based totally on sure
standards or traits. This segmentation allows advantage insights into unique subgroups inside the information and
might cause extra focused analysis.

7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the preliminary
exploration of the data. It facilitates form the inspiration for in addition evaluation and model building.

8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It involves checking
for records integrity, consistency, and accuracy to make certain the information is suitable for analysis.

Types of EDA

Depending on the number of columns we are analyzing we can divide EDA into two types.

EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information units to uncover
styles, pick out relationships, and gain insights. There are various sorts of EDA strategies that can be hired relying on
the nature of the records and the desires of the evaluation. Here are some not unusual kinds of EDA:

1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables inside the records
set. It involves summarizing and visualizing a unmarried variable at a time to understand its distribution, relevant
tendency, unfold, and different applicable records. Techniques like histograms, field plots, bar charts, and precis
information are generally used in univariate analysis.

2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It enables find
associations, correlations, and dependencies between pairs of variables. Scatter plots, line plots, correlation
matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater than variables. It
ambitions to apprehend the complex interactions and dependencies among more than one variables in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and primary component analysis (PCA) are
used for multivariate analysis.

4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal component.
Time collection evaluation entails inspecting and modeling styles, traits, and seasonality inside the statistics through
the years. Techniques like line plots, autocorrelation analysis, transferring averages, and ARIMA (AutoRegressive
Integrated Moving Average) fashions are generally utilized in time series analysis.

5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact the reliability and
validity of the evaluation. Missing statistics analysis includes figuring out missing values, know-how the patterns of
missingness, and using suitable techniques to deal with missing data. Techniques along with lacking facts styles,
imputation strategies, and sensitivity evaluation are employed in lacking facts evaluation.

6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample of the facts. Outlier
analysis includes identifying and knowledge the presence of outliers, their capability reasons, and their impact at the
analysis. Techniques along with box plots, scatter plots, z-rankings, and clustering algorithms are used for outlier
evaluation.

7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible representations of the
statistics to facilitate understanding and exploration. Various visualization techniques, inclusive of bar charts,
histograms, scatter plots, line plots, heatmaps, and interactive dashboards, are used to represent exclusive kinds of
statistics.

These are just a few examples of the types of EDA techniques that can be employed at some stage in information
evaluation. The choice of strategies relies upon on the information traits, research questions, and the insights sought
from the analysis.

3. Descriptive Statistics

In Descriptive statistics, we are describing our data with the help of various representative methods using
charts, graphs, tables, excel files, etc. In descriptive statistics, we describe our data in some manner and
present it in a meaningful way so that it can be easily understood. Most of the time it is performed on small
data sets and this analysis helps us a lot to predict some future trends based on the current findings. Some
measures that are used to describe a data set are measures of central tendency and measures of variability or
dispersion.

Types of Descriptive Statistics


 Measures of Central Tendency
 Measure of Variability
 Measures of Frequency Distribution
Measures of Central Tendency

It represents the whole set of data by a single value. It gives us the location of the central points. There are
three main measures of central tendency:

Mode

The mode is the most commonly occurring value in a distribution.

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.

Frequency distribution table

ImageDescription
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Advantage of the mode

The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.

Limitations of the mode

The are some limitations to using the mode. In some distributions, the mode may not reflect the centre of the
distribution very well. When the distribution of retirement age is ordered from lowest to highest value, it is
easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-
modal). The presence of more than one mode can limit the ability of the mode in describing the centre or
typical value of the distribution because a single value to describe the centre cannot be identified.

In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e. if all
values are different).

In cases such as these, it may be better to consider using the median or mean or group the data into
appropriate intervals and find the modal class.

Median

The median is the middle value in distribution when the values are arranged in ascending or descending
order.

The median divides the distribution in half (there are 50% of observations on either side of the median
value). In a distribution with an odd number of observations, the median value is the middle value.
Looking at the retirement age distribution (which has 11 observations), the median is the middle value,
which is 57 years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the mean of the two middle
values. In the following distribution, the two middle values are 56 and 57, therefore the median equals 56.5
years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the median

The median is less affected by outliers and skewed data than the mean and is usually the preferred measure
of central tendency when the distribution is not symmetrical.

Limitation of the median

The median cannot be identified for categorical nominal data, as it cannot be logically ordered.

Mean

The mean is the sum of the value of each observation in a dataset divided by the number of observations.
This is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623)
and dividing by the number of observations (11) which equals 56.6 years.

Advantage of the mean

The mean can be used for both continuous and discrete numeric data.

Limitations of the mean

The mean cannot be calculated for categorical data, as the values cannot be summed.

As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.

Another thing about the mean


The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean is calculated
on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-bar).

Impact of shape of distribution on measures of central tendency

Symmetrical distributions

When a distribution is symmetrical, the mode, median and mean are all in the middle of the
distribution. The following graph shows a larger retirement age dataset with a distribution which is
symmetrical. The mode, median and mean all equal 58 years.

Retirement age: Symmetrical distribution

Skewed distributions

When a distribution is skewed the mode remains the most commonly occurring value, the median remains
the middle value in the distribution, but the mean is generally ‘pulled’ in the direction of the tails. In a
skewed distribution, the median is often a preferred measure of central tendency, as the mean is not usually
in the middle of the distribution.

A distribution is said to be positively or right skewed when the tail on the right side of the distribution is
longer than the left side. In a positively skewed distribution it is common for the mean to be ‘pulled’ toward
the right tail of the distribution. Although there are exceptions to this rule, generally, most of the values,
including the median value, tend to be less than the mean value.

The following graph shows a larger retirement age data set with a distribution which is right skewed. The
data has been grouped into classes, as the variable being measured (retirement age) is continuous. The mode
is 54 years, the modal class is 54-56 years, the median is 56 years, and the mean is 57.2 years.

Retirement age: Positive (right) skew


A distribution is said to be negatively or left skewed when the tail on the left side of the distribution is longer
than the right side. In a negatively skewed distribution, it is common for the mean to be ‘pulled’ toward the
left tail of the distribution. Although there are exceptions to this rule, generally, most of the values,
including the median value, tend to be greater than the mean value.

The following graph shows a larger retirement age dataset with a distribution which left skewed. The mode
is 65 years, the modal class is 63-65 years, the median is 63 years and the mean is 61.8 years.

Retirement age: Negative (left) skew

Outliers influence on measures of central tendency

Outliers are extreme, or atypical data value(s) that are notably different from the rest of the data.

It is important to detect outliers within a distribution, because they can alter the results of the data
analysis. The mean is more sensitive to the existence of outliers than the median or mode.
Consider the initial retirement age dataset again, with one difference; the last observation of 60 years has
been replaced with a retirement age of 81 years. This value is much higher than the other values, and could
be considered an outlier. However, it has not changed the middle of the distribution, and therefore the
median value is still 57 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 81

As the all values are included in the calculation of the mean, the outlier will influence the mean value.

(54+54+54+55+56+57+57+58+58+60+81 = 644), divided by 11 = 58.5 years

In this distribution the outlier value has increased the mean value.

Despite the existence of outliers in a distribution, the mean can still be an appropriate measure of central
tendency, especially if the rest of the data is normally distributed. If the outlier is confirmed as a valid
extreme value, it should not be removed from the dataset. Several common regression techniques can help
reduce the influence of outliers on the mean value.

Standard Deviation
Standard deviation is the degree of dispersion or the scatter of the data points relative to its mean, in
descriptive statistics. It tells how the values are spread across the data sample and it is the measure of the
variation of the data points from the mean. The standard deviation of a data set, sample, statistical
population, random variable, or probability distribution is the square root of its variance.

When we have n number of observations and the observations are x1,x2,.....xn𝑥1,𝑥2,.....𝑥𝑛, then the mean
deviation of the value from the mean is determined as ∑ni=1(xi−¯x)2∑𝑖=1𝑛(𝑥𝑖−𝑥¯)2. However, the sum of
squares of deviations from the mean doesn't seem to be a proper measure of dispersion. If the average of the
squared differences from the mean is small, it indicates that the observations xi𝑥𝑖 are close to the mean ¯x𝑥¯.
This is a lower degree of dispersion. If this sum is large, it indicates that there is a higher degree of
dispersion of the observations from the mean ¯x𝑥¯. Thus we conclude that ∑ni=1(xi−¯x)2∑𝑖=1𝑛(𝑥𝑖−𝑥¯)2 is
a reasonable indicator of the degree of dispersion or scatter.

We take 1n∑ni=1(xi−¯x)21𝑛∑𝑖=1𝑛(𝑥𝑖−𝑥¯)2 as a proper measure of dispersion and this is called the


variance(σ2). The positive square root of the variance is the standard deviation.
Standard Deviation Formula

The spread of statistical data is measured by the standard deviation. The degree of dispersion is computed by
the method of estimating the deviation of data points. You can read about dispersion in summary statistics.
As discussed, the variance of the data set is the average square distance between the mean value and each
data value. And standard deviation defines the spread of data values around the mean. Here are two standard
deviation formulas that are used to find the standard deviation of sample data and the standard deviation of
the given population.

 Example 1: There are 39 plants in the garden. A few plants were selected randomly and their heights in
cm were recorded as follows: 51, 38, 79, 46, 57. Calculate the standard deviation of their heights.

Solution:

n=5

Sample mean (¯x𝑥¯)= (51+38+79+46+57)/5 = 54.2

Since, sample data is given, we use the sample SD formula.

SD = √Σ(xi−¯x)2n−1Σ(𝑥𝑖−𝑥¯)2𝑛−1

= √(51−54.2)2+(38−54.2)2+(79−54.2)2+(46−54.2)2+(57−54.2)24(51−54.2)2+(38−54.2)2+(79−54.2)2+(46−54.2
)2+(57−54.2)24

= 15.5

Answer: Standard deviation for this data is 15.5

 Example 3: Find the standard deviation of X which has the probability distribution as shown in the
table below.

X P(X)
X P(X)

4 0.2

5 0.3

6 0.5

 Solution:

 To find the expected value of X, find the product X. P(X) and sum these terms.

X P(X) X. P(X)

4 0.2 0.8

5 0.3 1.5

6 0.5 3.0

 E(X) = 0.8+1.5+3.0

 =5.3

X P(X) (X - E(X))2 (X - E(X))2. P(X)

4 0.2 1.69 0.338

5 0.3 0.09 0.027

6 0.5 0.49 0.245

 Standard Deviation = √(6.76+0.45+2.94)

 = √0.61 ≈ 0.78

 Answer: The standard deviation of the probability distribution is 0.78


Box Plot

The idea of box plot was presented by John Tukey in 1970. He wrote about it in his book “Exploratory Data
Analysis” in 1977. Box plot is also known as a whisker plot, box-and-whisker plot, or simply a box-and
whisker diagram. Box plot is a graphical representation of the distribution of a dataset. It displays key
summary statistics such as the median, quartiles, and potential outliers in a concise and visual manner. By
using Box plot you can provide a summary of the distribution, identify potential and compare different
datasets in a compact and visual manner.

Elements of Box Plot

A box plot gives a five-number summary of a set of data which is-

 Minimum – It is the minimum value in the dataset excluding the outliers.

 First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.

 Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half above.

 Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.

 Maximum – It is the maximum value in the dataset excluding the outliers.

The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR is calculated as –

IQR = Q3-Q1

Outlies are the data points below and above the lower and upper limit. The lower and upper limit is
calculated as –

Lower Limit = Q1 - 1.5*IQR

Upper Limit = Q3 + 1.5*IQR


The values below and above these limits are considered outliers and the minimum and maximum values are
calculated from the points which lie under the lower and upper limit.

How to create a box plots?

Let us take a sample data to understand how to create a box plot.

Here are the runs scored by a cricket team in a league of 12 matches – 100, 120, 110, 150, 110, 140, 130,
170, 120, 220, 140, 110.

To draw a box plot for the given data first we need to arrange the data in ascending order and then find the
minimum, first quartile, median, third quartile and the maximum.

Ascending Order

100, 110, 110, 110, 120, 120, 130, 140, 140, 150, 170, 220

Median (Q2) = (120+130)/2 = 125; Since there were even values

To find the First Quartile we take the first six values and find their median.

Q1 = (110+110)/2 = 110

For the Third Quartile, we take the next six and find their median.

Q3 = (140+150)/2 = 145

Note: If the total number of values is odd then we exclude the Median while calculating Q1 and Q3. Here
since there were two central values we included them. Now, we need to calculate the Inter Quartile Range.

IQR = Q3-Q1 = 145-110 = 35

We can now calculate the Upper and Lower Limits to find the minimum and maximum values and also the
outliers if any.

Lower Limit = Q1-1.5*IQR = 110-1.5*35 = 57.5

Upper Limit = Q3+1.5*IQR = 145+1.5*35 = 197.5

So, the minimum and maximum between the range [57.5,197.5] for our given data are –

Minimum = 100 Maximum = 170 The outliers which are outside this range are – Outliers = 220
Now we have all the information, so we can draw the box plot which is as below-

We can see from the diagram that the Median is not exactly at the center of the box and one whisker is
longer than the other. We also have one Outlier.

Pivot table

A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data in
a spreadsheet or database table to obtain a desired report. The tool does not actually change the spreadsheet
or database itself, it simply “pivots” or turns the data to view it from different perspectives.

Pivot tables are especially useful with large amounts of data that would be time-consuming to calculate by
hand. A few data processing functions a pivot table can perform include identifying sums, averages, ranges
or outliers. The table then arranges this information in a simple, meaningful layout that draws attention to
key values.

Pivot table is a generic term, but is sometimes confused with the Microsoft trademarked term, PivotTable.
This refers to a tool specific to Excel for creating pivot tables.

How pivot tables work

When users create a pivot table, there are four main components:

1. Columns- When a field is chosen for the column area, only the unique values of the field are listed
across the top.

2. Rows- When a field is chosen for the row area, it populates as the first column. Similar to the
columns, all row labels are the unique values and duplicates are removed.

3. Values- Each value is kept in a pivot table cell and display the summarized information. The most
common values are sum, average, minimum and maximum.

4. Filters- Filters apply a calculation or restriction to the entire table.

For example, a store owner might list monthly sales totals for a large number of merchandise items in an
Excel spreadsheet. If they wanted to know which items sold better in a particular financial quarter, they
could use a pivot table. The sales quarters would be listed across the top as column labels and the products
would be listed in the first column as rows. The values in the worksheet would show the sum of sales for
each product in each quarter. A filter could then be applied to only show specific quarters, specific products
or averages.

Uses of a pivot table

A pivot table helps users answer business questions with minimal effort. Common pivot table uses include:

 To calculate sums or averages in business situations. For example, counting sales by department or
region.
 To show totals as a percentage of a whole. For example, comparing sales for a specific product to
total sales.
 To generate a list of unique values. For example, showing which states or countries have ordered a
product.
 To create a 2x2 table summary of a complex report.
 To identify the maximum and minimum values of a dataset.
 To query information directly from an online analytical processing (OLAP) server.

Heat Map
A heatmap (aka heat map) depicts values for a main variable of interest across two axis variables as a grid of
colored squares. The axis variables are divided into ranges like a bar chart or histogram, and each cell’s color
indicates the value of the main variable in the corresponding cell range.
The example heatmap above depicts the daily precipitation distribution, grouped by month, and recorded
over eleven years in Seattle, Washington. Each cell reports a numeric count, like in a standard data table, but
the count is accompanied by a color, with larger counts associated with darker colorings. From the heat map,
we can see from the darkest colorings in the left-most column that most days had no precipitation across the
entire year. The pattern in cell colors across months also shows that rain is more common in the winter from
November to March, and least common in the summer months of July and August.

2-d density plots

The term heatmap is also used in a more general sense, where data is not constrained to a grid. For example,
tracking tools for websites can be set up to see how users interact with the site, like studying where a user
clicks, or how far down a page readers tend to scroll.

Example heatmap from Google Maps documentation

Every click (or other tracking event) is associated with a position, which radiates a small amount of numeric
value around its location. These values are totaled together across all events and then plotted with an
associated colormap. The visual language of these tools’ output, associating value with color, is similar to
the type of heatmap defined at the top, just without a grid-based structure. Heatmaps of this type are
sometimes also known as 2-d density plots.

When you should use a heatmap

Heatmaps are used to show relationships between two variables, one plotted on each axis. By observing how
cell colors change across each axis, you can observe if there are any patterns in value for one or both
variables.
The variables plotted on each axis can be of any type, whether they take on categorical labels or numeric
values. In the latter case, the numeric value must be binned like in a histogram in order to form the grid cells
where colors associated with the main variable of interest will be plotted.

Cell colorings can correspond to all manner of metrics, like a frequency count of points in each bin, or
summary statistics like mean or median for a third variable. One way of thinking of the construction of a
heatmap is as a table or matrix, with color encoding on top of the cells. In certain applications, it is also
possible for cells to be colored based on non-numeric values (e.g. general qualitative levels of low, medium,
high).

Example of data structure

USERNAME < 0.01 0.1 - 4.0 4.1 - 10.0 …

January 255 167 123 …

February 244 196 89 …

March 268 198 119 …

April 321 179 88 …

… … … … …

Different visualization applications can have different ways of accepting data for plotting as a heatmap. In
one major form, data can be supplied in the same way it would be naturally displayed as a table. The first
column will hold values for one axis of the heatmap, while the names of the remaining columns will
correspond with bins for the remaining axis. Values in those columns will be encoded into the heatmap
itself.

The other common form for heatmap data sets it up in a three-column format. Each cell in the heatmap is
associated with one row in the data table. The first two columns specify the ‘coordinates’ of the heat map
cell, while the third column indicates the cell’s value.
MONTH PRCP_BUCKET COUNT

March 10.1 - 20.0 46

March > 20.0 20

April < 0.1 321

April 0.1 - 4.0 179

… … …

Best practices for using a heatmap

Choose an appropriate color palette

Color is a core component of this chart type, so it’s worth making sure that you choose an appropriate color
palette to match the data. Most frequently, there will be a sequential color ramp between value and color,
where lighter colors correspond to smaller values and darker colors to larger values, or vice versa. However,
a diverging color palette may be used when values have a meaningful zero point.

Include a legend

As an associated note, it is generally required for a heatmap to include a legend for how colors map to
numeric values. Since color on its own has no inherent association with value, a key is vital for viewers to
grasp the values in a heatmap. An exception for including a legend can come when the absolute association
of value to color is not important, only the relative patterns of data plotted.

Show values in cells

There is a lack of precision for mapping color to value, especially compared to other encodings like position
or length. Where possible, it is a good idea to add cell value annotations to the heatmap as a double encoding
of value.

Sort levels by similarity or value

When one or both axis variables in a plot are categorical in nature, it can be worth considering changing the
order in which those axis variable levels are plotted. If the categories do not have an inherent ordering, we
might want to choose an order that best helps the reader grasp patterns in the data. A common option is to
sort categories by their average cell value from largest to smallest.

The right-side heatmap is sorted by the last column value.

A more advanced technique involves grouping and clustering category values by measurement of similarity.
This is often seen in the clustered heatmap use case discussed below.
Select useful tick marks

For numeric axis variables, choices can be made in how bins are set up and how they are indicated in the
chart. If there are few bins, it is fine to keep tick marks on each bin like for a categorical axis variable.
However, when there are a lot of bins, a better option is to plot tick marks between sets of bins to avoid
overcrowding. The number of bins that you should use and how large they are will depend on the nature of
the data, so it can be a good idea to experiment with different settings. See our article on histograms for
more detailed tips on setting bin sizes for numeric variables.

Common heatmap options

Clustered heatmap

Instead of having the horizontal axis represent levels or values of a single variable, it is a common variation
to have it represent measurements of different variables or metrics. If we set the vertical axis as individual
observations, we end up with something resembling a standard data table, where each row is an observation
and the columns the entity’s value on each measured variable.

This type of heatmap is sometimes known as a clustered or clustering heatmap, since the goal of this kind of
chart is to build associations between both the data points and their features. We want to see which
individuals are similar or different from each other, with a similar objective for variables. Analysis tools that
construct this type of heatmap will usually implement clustering as part of their process. This use case is
found in areas like the biological sciences, such as when studying similarities in gene expression across
individuals.
In the above clustered heatmap, each column represents an individual flower specimen, and each row a
measurement from that specimen.

Correlogram

A correlogram is a variant of the heatmap that replaces each of the variables on the two axes with a list of
numeric variables in the dataset. Each cell depicts the relationship between the intersecting variables, such as
a linear correlation. Sometimes, these simple correlations are replaced with more complex representations of
relationship, like scatter plots.

Correlograms are often seen in an exploratory role, helping analysts understand relationships between
variables in service of building descriptive or predictive statistical models.

Petal length is highly correlated with petal width and sepal length; sepal length is negatively correlated with
the other three variables.

Related plots

Bar chart and histogram

The closest one-dimensional analogues for the heatmap are the bar chart and histogram, corresponding to
categorical and numeric data, respectively. For these charts, bar lengths are indicators of value, instead of
color. (Although it’s worth noting that histogram bars tend to solely depict frequency information – when a
summary metric is computed on each bin, we tend to use a line chart instead.) The best practices notes for
ordering levels and setting tick marks above come from these more basic chart types.

Grouped bar chart

An alternative way of showing data in a heatmap is through a grouped bar chart. Each row of the heatmap
becomes a cluster of bars, and each bar’s height indicates the corresponding cell’s value. Color is instead
used to make sure that column values can be tracked between clusters.

Grouped bar charts are used when more precise comparisons between cell values are desired. However, they
are a poor choice when there are a lot of bars that need to be plotted and when both axis variables are
numeric in nature. In that case, it’s best to stick with the heatmap, which is more compact and does a better
job of showing a broad overview across both axis variables at the same time.

Scatter plot

Scatter plots may not seem related to heatmaps, since they plot individual data points by position rather than
color. However, when there are so many data points that they have a high level of overlap, this can obscure
the relationship between variables, an issue called overplotting. One of the options for overcoming
overplotting is to use a heatmap instead, which counts the number of points that fall in each bin. This use of
a heatmap is also known as a 2-d histogram.

Choropleth

The language of associating color to value is not solely the domain of the heatmap. One particular example
of this kind of encoding can be seen in the choropleth. A choropleth is like a heatmap in that numeric values
are encoded with colored areas, but these values are associated with geographic regions rather than a strict
grid.

ANOVA

ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between
the means of more than two groups. A one-way ANOVA uses one independent variable, while a two-way
ANOVA uses two independent variables.

When to use a one-way ANOVA

Use a one-way ANOVA when you have collected data about one categorical independent variable and
one quantitative dependent variable. The independent variable should have at least three levels (i.e. at least
three different groups or categories).

ANOVA tells you if the dependent variable changes according to the level of the independent variable. For
example:
 Your independent variable is social media use, and you assign groups to low, medium,
and high levels of social media use to find out if there is a difference in hours of sleep per night.

 Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to
find out if there is a difference in the price per 100ml.

 You independent variable is type of fertilizer, and you treat crop fields with mixtures 1, 2 and 3 to
find out if there is a difference in crop yield.

The null hypothesis (H0) of ANOVA is that there is no difference among group means. The alternative
hypothesis (Ha) is that at least one group differs significantly from the overall mean of the dependent
variable.

How does an ANOVA test work?

ANOVA determines whether the groups created by the levels of the independent variable are statistically
different by calculating whether the means of the treatment levels are different from the overall mean of the
dependent variable.

If any of the group means is significantly different from the overall mean, then the null hypothesis is
rejected.

ANOVA uses the F test for statistical significance. This allows for comparison of multiple means at once,
because the error is calculated for the whole set of comparisons rather than for each individual two-way
comparison (which would happen with a t test).

The F test compares the variance in each group mean from the overall group variance. If the variance within
groups is smaller than the variance between groups, the F test will find a higher F value, and therefore a
higher likelihood that the difference observed is real and not due to chance.

Assumptions of ANOVA

The assumptions of the ANOVA test are the same as the general assumptions for any parametric test:

1. Independence of observations: the data were collected using statistically valid sampling methods,
and there are no hidden relationships among observations. If your data fail to meet this assumption
because you have a confounding variable that you need to control for statistically, use an ANOVA
with blocking variables.

2. Normally-distributed response variable: The values of the dependent variable follow a normal
distribution.
3. Homogeneity of variance: The variation within each group being compared is similar for every
group. If the variances are different among the groups, then ANOVA probably isn’t the right fit for
the data.

Examples:
TWO-WAY ANOVA

In Example it could not be shown that there really is a significant difference in the average cholesterol
content of the four diet foods. The results were not statistically different because there was a considerable
difference in the values within each of the samples resulting in a large experimental error. However, if we
have additional information that each of the value was randomly measured in the three different laboratories
in such a way that the first value of each sample came from laboratory 1, the second value from laboratory 2,
and the third value from laboratory 3. (the random assignment of test units to labs) In such a case, a two way
Analysis of variance is suggested. We had earlier partitioned the total sum of squares into two components—
one which is due to the differences between the sample (treatment sum of squares) and the other one due to
the differences within the samples (error sum of squares). Now, this error sum of square includes the sum of
squares due to laboratories (called blocks) as an extraneous factor. In two-way analysis of variance,
we remove the effect of the extraneous factors (laboratories or blocks) from the error sum of squares.
Therefore, the total sum of square is partitioned into three components—one due to treatment, second due to
block, and the third one due to chance (called the error sum of squares). It may be noted that the total sum of
squares (TSS) and the treatment sum of squares (TrSS) would remain the same as computed earlier in
Example 13.1. In addition, we will have another component called Block sum of squares (SSB), which is
due to different laboratories and is computed as:
Correlation Statistics:

This section shows how to calculate and interpret correlation coefficients for ordinal and interval level
scales. Methods of correlation summarize the relationship between two variables in a single number called
the correlation coefficient. The correlation coefficient is usually represented using the symbol r, and it ranges
from -1 to +1.

A correlation coefficient quite close to 0, but either positive or negative, implies little or no relationship
between the two variables. A correlation coefficient close to plus 1 means a positive relationship between the
two variables, with increases in one of the variables being associated with increases in the other variable.

A correlation coefficient close to -1 indicates a negative relationship between two variables, with an increase
in one of the variables being associated with a decrease in the other variable. A correlation coefficient can be
produced for ordinal, interval or ratio level variables, but has little meaning for variables which are measured
on a scale which is no more than nominal.

For ordinal scales, the correlation coefficient can be calculated by using Spearman’s rho. For interval or ratio
level scales, the most commonly used correlation coefficient is Pearson’s r, ordinarily referred to as simply
the correlation coefficient.

What Does Correlation Measure?

In statistics, Correlation studies and measures the direction and extent of relationship among variables, so the
correlation measures co-variation, not causation. Therefore, we should never interpret correlation as
implying cause and effect relation. For example, there exists a correlation between two variables X and Y,
which means the value of one variable is found to change in one direction, the value of the other variable is
found to change either in the same direction (i.e. positive change) or in the opposite direction (i.e. negative
change). Furthermore, if the correlation exists, it is linear, i.e. we can represent the relative movement of the
two variables by drawing a straight line on graph paper.
Correlation Coefficient

The correlation coefficient, r, is a summary measure that describes the extent of the statistical relationship
between two interval or ratio level variables. The correlation coefficient is scaled so that it is always between
-1 and +1. When r is close to 0 this means that there is little relationship between the variables and the
farther away from 0 r is, in either the positive or negative direction, the greater the relationship between the
two variables.

The two variables are often given the symbols X and Y. In order to illustrate how the two variables are
related, the values of X and Y are pictured by drawing the scatter diagram, graphing combinations of the two
variables. The scatter diagram is given first, and then the method of determining Pearson’s r is presented.
From the following examples, relatively small sample sizes are given. Later, data from larger samples are
given.
Scatter Diagram

A scatter diagram is a diagram that shows the values of two variables X and Y, along with the way in which
these two variables relate to each other. The values of variable X are given along the horizontal axis, with
the values of the variable Y given on the vertical axis.

Later, when the regression model is used, one of the variables is defined as an independent variable, and the
other is defined as a dependent variable. In regression, the independent variable X is considered to have
some effect or influence on the dependent variable Y. Correlation methods are symmetric with respect to the
two variables, with no indication of causation or direction of influence being part of the statistical
consideration. A scatter diagram is given in the following example. The same example is later used to
determine the correlation coefficient.

Types of Correlation

The scatter plot explains the correlation between the two attributes or variables. It represents how closely the
two variables are connected. There can be three such situations to see the relation between the two variables

 Positive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the
other variable.

 Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the
other variable.

 No Correlation – when there is no linear dependence or no relation between the two variables.
Correlation Formula

Correlation shows the relation between two variables. Correlation coefficient shows the measure of
correlation. To compare two datasets, we use the correlation formulas.

Pearson Correlation Coefficient Formula

The most common formula is the Pearson Correlation coefficient used for linear dependency between the
data sets. The value of the coefficient lies between -1 to +1. When the coefficient comes down to zero, then
the data is considered as not related. While, if we get the value of +1, then the data are positively correlated,
and -1 has a negative correlation.

Where n = Quantity of Information

Σx = Total of the First Variable Value

Σy = Total of the Second Variable Value

Σxy = Sum of the Product of first & Second Value


Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

Linear Correlation Coefficient Formula

The formula for the linear correlation coefficient is given by;

Sample Correlation Coefficient Formula

The formula is given by:

rxy = Sxy/SxSy

Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance.

Population Correlation Coefficient Formula

The population correlation coefficient uses σx and σy as the population standard deviations and σxy as the
population covariance.

rxy = σxy/σxσy

You might also like