DS Notes Unit - III
DS Notes Unit - III
MODEL DEVELOPMENT
Descriptive statistics:
Descriptive statistics describe, show, and summarize the basic features of a dataset found in a
given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Descriptive statistics represent the available data sample and does not include theories, inferences,
If you want a good example of descriptive statistics, look no further than a stud
average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams, then averages them together and presents a general idea of the
mean
academic performance. Note that the GPA predict future performance or present any
conclusions.
Descriptive statistics break down into several types, characteristics, or measures. Some authors
say that there are two types. Others say three or even four. In the spirit of working with averages,
we will go with three types.
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to
summarize the frequency of every possible value of a variable, rendered in percentages or
column with all possible variables (John, Paul, George, and Ringo), and another with the number
of votes.
1
Statisticians depict frequency distributions as either a graph or as a table.
Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.
Mean
You get the mean by adding all the response values together, dividing the sum by the number of
is trying to figure out how many hours a day they
sleep in a week. So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9), and the sum of
those values is 52. There are seven responses, so N=7. You divide the value sum of 52 by N, or
7, to find M, which in this instance is 7.3.
Mode. The mode is just the most frequent response value. Datasets may have any number of
modes, including You can find the mode by arranging your dataset's order from the
lowest
to highest value and then looking for the most common response. So, in using our sleep study
from the last part: 4,6,7,8,8,9,10. As you can see, the mode is eight.
Median. Finally, we have the median, defined as the value in the precise center of the dataset.
Arrange the values in ascending order (like we did for the mode) and look for the number in the
The measure of variability gives the statistician an idea of how spread out the responses are.
The spread has three aspects range, standard deviation, and variance.
Range. Use range to determine how far apart the most extreme values are. Start by subtracting
the d
range.
Standard Deviation. This aspect takes a little more work. The standard deviation (s) is your
The
3
6 6-7.3= -1.3 1.69
9 9-7.3=1.7 2.89
When you divide the sum of the squared deviations by 6 (N-1): 23.83/6, you get 3.971, and the
square root of that result is 1.992. As a result, we now know that each score deviates from the
mean by an average of 1.992 points.
Variance: Variance reflects the degree spread. The greater the degree of data spread,
the larger the variance relative to the mean. You can get the variance by just squaring the
standard deviation. Using our above example, we square 1.992 and arrive at 3.971.
4
Skewness:
Skewness is the measure of the asymmetry of an ideally symmetric probability distribution and
is
given by the third standardized moment. If that sounds way too worry! Let me
complex,
break it down for you.
In simple words, skewness is the measure of how much the probability distribution of a random
variable deviates from the normal distribution.
the normal distribution is the probability distribution without any skewness. You can look at the
image below which shows symmetrical distribution basically a normal distribution and you
can see that it is symmetrical on both sides of the dashed line. Apart from this, there are two
types of skewness:
Positive Skewness
Negative Skewness
5
The probability distribution with its tail on the right side is a positively skewed distribution and
the one with its tail on the left side is a negatively skewed distribution.
In Normal Distribution :
Standard Normal Distribution: When in the Normal Distribution mean = 0 and the Standard
Deviation = 1 then Normal Distribution is called as Standard Normal Distribution.
Normal Distributions are symmetrical in nature it imply that every symmetrical
distribution is a Normal Distribution.
Normal Distribution is the probability distribution without any skewness.
6
Types of Skewness
Positive Skewness
Negative Skewness
Unlike the Normal Distribution (mean = median = mode), in positive as well as negative
skewness mean, median, and mode all are different.
Positive Skewness
In positive skewness, the extreme data values are larger, which in turn increase the mean value of
the data set, or in the simple term in positive skew distribution is the distribution having the tail
on the right side.
In Positive Skewness:
Negative Skewness
In negative skewness, the extreme data values are smaller, which decreases the mean value of the
dataset or the negative skew distribution is the distribution having the tail on the left side.
In Negative Skewness:
7
Calculate the skewness coefficient of the sample
Subtract a mode from a mean, then divides the difference by standard deviation.
8
If the skewness is between -0.5 & 0.5, the data are nearly
symmetrical.
If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the
If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data
Kurtosis:
Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal
distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and
negative if the tails are "lighter" than for a normal distribution. The normal distribution has
kurtosis of zero.
Kurtosis characterizes the shape of a distribution - that is, its value does not depend on an
arbitrary change of the scale and location of the distribution. For example, kurtosis of a sample
(or population) of temperature values in Fahrenheit will not change if you transform the values
Celsius (the mean and the variance will, however, change).
The kurtosis of a distribution or sample is equal to the 4th central moment divided by the 4th
power of the standard deviation, minus 3.
To calculate the kurtosis of a sample:
i) subtract the mean from each value to get a set of deviations from the mean;
ii) divide each deviation by the standard deviation of all the deviations;
iii) average the 4th power of the deviations and subtract 3 from the result.
9
Excess Kurtosis
The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient
with that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative
Leptokurtic is having very long and skinny tails, which means there are more chances of outliers.
Positive values of kurtosis indicate that distribution is peaked and possesses thick tails. An
extreme positive kurtosis indicates a distribution where more of the numbers are located in the
10
platykurtic (kurtosis < 3)
Platykurtic having a lower tail and stretched around center tails means most of the data points are
present in high proximity with mean. A platykurtic distribution is flatter (less peaked) when
Mesokurtic (kurtosis = 3)
Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In
Mesokurtic, distributions are moderate in breadth, and curves are a medium peaked height.
11
Box plot:
A box plot also known as Five Number Summary, summarizes data using the median, upper
quartile, lower quartile, and the minimum and maximum values. It allows you to see important
characteristics of the data at a glance(visually). This also help us to visualize outliers in the data
set.
1. Median
4. Minimum Value
5. Maximum Value
12
Working Example of Box
14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82, 95
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Step 2 Find the median of this data set. Median is mid value in this ordered data set.
13
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Here it is 68.
Lower Quartile is the median from the left of the, medium found in the Step 2(ie. 68)
(14, 19, 27, 50, 52, 54, 61), 68, 75, 82, 85, 87, 93, 95, 100
Lower Quartile is 50
Upper Quartile is the median from the Right of the medium found in the Step 2(ie. 68)
14, 19, 27, 50, 52, 54, 61, 68,( 75, 82, 85, 87, 93, 95, 100)
Upper Quartile is 87
It is value lies in the extreme left from this data set or first value in the data set after ordering.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
14
Minimum Value is 14
It is value lies in the extreme Right from this data set or last value in the data set after
ordering.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Range :
Range is basically spread of our data set.Range can be found as difference between Maximum
Value and Minimum Values.
15
Interquartile Range(IQR):
Step 1 :
14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 100
Step 2:Since we the even number take the middle two values add them and divide them by 2.
16
So our new median value is 64.5
Continue Step 3 to Step 6 to get the values mention in Working Example of Box
Plot section.Final Result as below
Pivot table:
A pivot table is a summary tool that wraps up or summarizes information sourced from bigger
tables. These bigger tables could be a database, an Excel spreadsheet, or any data that is or could
be converted in a table-like form. The data summarized in a pivot table might include sums,
averages, or other statistics which the pivot table groups together in a meaningful way.
17
Wide vs. Long Data
18
1. Index determines what will be unique in the leftmost column of the result.
2. Columns creates
of the input table.
4. aggfunc defines how to combine the values (commonly sum, but other aggregation
functions are also used: min, max, mean, 95th percentile, mode). You can also define your
own aggregation functions and pass in the function.
Pivoting in Google sheets is doing the exact same thing as our python code above but in a
graphical way. Instead of specifying arguments for the pivot_table function, we select from
Grouping in PostgreSQL
approximates the desired results but it does require some manual intervention to define the
possible groups.
Aggregation Functions
Sum is often used to combine data in pivot tables but pivot tables are much more flexible than
just simple sums. Different tools will provide their own selection of aggregation functions. For
example, Pandas provides: min, max, first, last, unique, std (standard deviation),
19
var (variance), count, unique, quantile among others. We can also define our own
aggregation functions and pass multiple different aggregation functions to the same column or
different columns in the same pivot table.
Missing Data
As with any aggregation missing data must be dealt with. In the example data used here there are
numerous months with 0
those month, not that there were
missing null
need to accept (and be aware of) the default behaviour for these cases or specify what to do.
Heat map:
Heatmaps visualize the data in a 2-dimensional format in the form of colored maps. The color
maps use hue, saturation, or luminance to achieve color variation to display various details. This
color variation gives visual cues to the readers about the magnitude of numeric values.
Heatmaps can describe the density or intensity of variables, visualize patterns, variance, and even
anomalies. Heatmaps show relationships between variables. These variables are plotted on both
axes.
Uses of HeatMap
Business Analytics: A heat map is used as a visual business analytics tool. A heat map gives
quick visual cues about the current results, performance, and scope for improvements. Heatmaps
can analyze the existing data and find areas of intensity that might reflect where most customers
reside, areas of risk of market saturation, or cold sites and sites that need a boost. Heat maps can
20
be continued to be updated to reflect the growth and efforts. These maps can be integrated into a
Website:
visualization helps business owners and marketers to identify the best & worst-performing
sections of a webpage. These insights help them with optimization.
Exploratory Data Analysis: EDA is a task performed by data scientists to get familiar
with the data. All the initial studies are done to understand the data are known as EDA.
Exploratory Data Analysis (EDA) is the process of analyzing datasets before the
modeling task. It is a tedious task to look at a spreadsheet filled with numbers and
determine essential characteristics in a dataset.
Molecular Biology: Heat maps are used to study disparity and similarity patterns in DNA,
RNA, etc.
Geovisualization: Geospatial heatmap charts are useful for displaying how geographical
areas of a map are compared to one another based on specific criteria. Heatmaps help in
cluster analysis or hotspot analysis to detect clusters of high concentrations of activity;
For example, Airbnb rental price analysis.
Types of HeatMaps
Grid Heatmap: The magnitudes of values shown through colors are laid out into a matrix of
rows and columns, mostly by a density-based function. Below are the types of Grid
Heatmaps.
21
Clustered Heatmap: The goal of Clustered Heatmap is to build associations between both
the data points and their features. This type of heatmap implements clustering as part of
the process of grouping similar features. Clustered Heatmaps are widely used in
biological sciences for studying gene similarities across individuals.
The order of the rows in Clustered Heatmap is determined by performing hierarchical cluster
analysis of the rows. Clustering positions similar rows together on the map. Similarly, the order
of the columns is determined.
Correlogram: A correlogram replaces each of the variables on the two axes with numeric
variables in the dataset. Each square depicts the relationship between the two intersecting
variables, which helps to build descriptive or predictive statistical models.
Correlation statistics:
Correlation
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
Correlation Coefficient
22
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
output:
23
#Three lines to make our compiler able to draw:
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]}
negative_corr = pd.DataFrame(data=negative_corr)
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
output:
24
We have plotted fictional data here. The x-axis represents the amount of hours worked at our job
before a training session. The y-axis is Calorie_Burnage.
If we work longer hours, we tend to have lower calorie burnage because we are exhausted before
the training session.
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
25
full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
output:
ANOVA:
ANOVA stands for analysis of variance and, as the name suggests, it helps us understand and
compare variances among groups. Before going in detail about ANOVA, remember a few
terms in statistics:
26
Mean: The average of all values.
27
Variance: A measure of the variation among values. It is calculated by adding up squared
differences of each value and the mean and then dividing the sum by the number of
samples.
In order to understand the motivation behind ANOVA, or some other statistical tests, we should
learn two simple terms: population and sample.
College students in US is a population that includes all of the college students in US.
25-year-old people in Europe is a population that includes all of the people that fits the
description.
It is not always feasible or possible to do analysis on population because we cannot collect all
the data of a population. Therefore, we use samples.
28
1000 college students in US is a subset of students in population.
When we compare two samples (or groups), we can use t-test to see if there is any difference in
means of groups. When we have more than two groups, t-test is not the optimal choice because
we need to apply t-test to pairs separately. Consider we have groups A, B and C. To be able to
compare the means, we need to apply a t-test to A-B, A-C and B-C. As the number of groups
increase, this becomes harder to manage.
In the case of comparing three or more groups, ANOVA is preferred. There are two elements of
ANOVA:
ANOVA test result is based on F ratio which is the ratio of the variation between groups to the
variation within groups.
F ratio shows how much of the total variation comes from the variation between groups and how
much comes from the variation within groups.
29