0% found this document useful (0 votes)
2 views

DS Notes Unit - III

The document discusses descriptive statistics, which summarize and describe the basic features of a dataset, including types such as distribution, central tendency, and variability. It explains key concepts like skewness and kurtosis, highlighting their significance in understanding data distribution and characteristics. Additionally, it covers practical tools like box plots and pivot tables for data visualization and summarization.

Uploaded by

Vishva
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DS Notes Unit - III

The document discusses descriptive statistics, which summarize and describe the basic features of a dataset, including types such as distribution, central tendency, and variability. It explains key concepts like skewness and kurtosis, highlighting their significance in understanding data distribution and characteristics. Additionally, it covers practical tools like box plots and pivot tables for data visualization and summarization.

Uploaded by

Vishva
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT 4

MODEL DEVELOPMENT

Descriptive statistics:
Descriptive statistics describe, show, and summarize the basic features of a dataset found in a
given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.

Descriptive statistics represent the available data sample and does not include theories, inferences,

If you want a good example of descriptive statistics, look no further than a stud
average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams, then averages them together and presents a general idea of the
mean

academic performance. Note that the GPA predict future performance or present any
conclusions.

Types of Descriptive Statistics

Descriptive statistics break down into several types, characteristics, or measures. Some authors
say that there are two types. Others say three or even four. In the spirit of working with averages,
we will go with three types.

Distribution, which deals with each frequency


Central tendency, which covers the averages of the values
Variability (or dispersion), which shows how spread out the values are

Distribution (also called Frequency Distribution)

Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to
summarize the frequency of every possible value of a variable, rendered in percentages or

column with all possible variables (John, Paul, George, and Ringo), and another with the number
of votes.
1
Statisticians depict frequency distributions as either a graph or as a table.

Measures of Central Tendency

Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.

Mean
You get the mean by adding all the response values together, dividing the sum by the number of
is trying to figure out how many hours a day they
sleep in a week. So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9), and the sum of
those values is 52. There are seven responses, so N=7. You divide the value sum of 52 by N, or
7, to find M, which in this instance is 7.3.

Mode. The mode is just the most frequent response value. Datasets may have any number of
modes, including You can find the mode by arranging your dataset's order from the
lowest
to highest value and then looking for the most common response. So, in using our sleep study
from the last part: 4,6,7,8,8,9,10. As you can see, the mode is eight.

Median. Finally, we have the median, defined as the value in the precise center of the dataset.
Arrange the values in ascending order (like we did for the mode) and look for the number in the

Variability (also called Dispersion)

The measure of variability gives the statistician an idea of how spread out the responses are.
The spread has three aspects range, standard deviation, and variance.

Range. Use range to determine how far apart the most extreme values are. Start by subtracting
the d

range.

Standard Deviation. This aspect takes a little more work. The standard deviation (s) is your
The

1. List the scores and their means.


2. Find the deviation by subtracting the mean from each score.
3. Square each deviation.
2
4. Total up all the squared deviations.
5. Divide the sum of the squared deviations by N-1.
6. Find the square root.
Example: we turn to our sleep study: 4,6,7,8,8,9,10.

Raw Number/Data Deviation from Mean Deviation Squared

4 4-7.3= -3.3 10.89

3
6 6-7.3= -1.3 1.69

7 7-7.3= -0.3 0.09

8 8-7.3= 0.7 0.49

8 8-7.3= 0.7 0.49

9 9-7.3=1.7 2.89

10 10-7.3= 2.7 7.29

M=7.3 Sum = 0.9 Square sums= 23.83

When you divide the sum of the squared deviations by 6 (N-1): 23.83/6, you get 3.971, and the
square root of that result is 1.992. As a result, we now know that each score deviates from the
mean by an average of 1.992 points.

Variance: Variance reflects the degree spread. The greater the degree of data spread,
the larger the variance relative to the mean. You can get the variance by just squaring the
standard deviation. Using our above example, we square 1.992 and arrive at 3.971.

4
Skewness:
Skewness is the measure of the asymmetry of an ideally symmetric probability distribution and
is
given by the third standardized moment. If that sounds way too worry! Let me
complex,
break it down for you.

In simple words, skewness is the measure of how much the probability distribution of a random
variable deviates from the normal distribution.

the normal distribution is the probability distribution without any skewness. You can look at the
image below which shows symmetrical distribution basically a normal distribution and you
can see that it is symmetrical on both sides of the dashed line. Apart from this, there are two
types of skewness:

Positive Skewness
Negative Skewness

5
The probability distribution with its tail on the right side is a positively skewed distribution and
the one with its tail on the left side is a negatively skewed distribution.

Normal Distribution: Normal Distribution is a probability distribution that is symmetric


about the mean. It is also known as Gaussian Distribution. The distribution appears as a Bell-
shaped curve which means the mean is the most frequent data in the given data set.

In Normal Distribution :

Mean = Median = Mode

Standard Normal Distribution: When in the Normal Distribution mean = 0 and the Standard
Deviation = 1 then Normal Distribution is called as Standard Normal Distribution.
Normal Distributions are symmetrical in nature it imply that every symmetrical
distribution is a Normal Distribution.
Normal Distribution is the probability distribution without any skewness.

6
Types of Skewness

Positive Skewness
Negative Skewness

Unlike the Normal Distribution (mean = median = mode), in positive as well as negative
skewness mean, median, and mode all are different.

Positive Skewness

In positive skewness, the extreme data values are larger, which in turn increase the mean value of
the data set, or in the simple term in positive skew distribution is the distribution having the tail
on the right side.

In Positive Skewness:

Mean > Median > Mode

Negative Skewness

In negative skewness, the extreme data values are smaller, which decreases the mean value of the
dataset or the negative skew distribution is the distribution having the tail on the left side.

In Negative Skewness:

Mean < Median< Mode

7
Calculate the skewness coefficient of the sample

first coefficient of skewness

Subtract a mode from a mean, then divides the difference by standard deviation.

As -1 (perfect negative linear relationship) to +1


(perfect positive linear relationship), including a value of 0 indicating no linear relationship,
When we divide the covariance values by the standard deviation, it truly scales the value down to
a limited range of -1 to +1. That accurately the range of the correlation values.

second coefficient of skewness

Multiply the difference by 3, and divide the product by standard deviation.

8
If the skewness is between -0.5 & 0.5, the data are nearly
symmetrical.

If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the

data are slightly skewed.

If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data

are extremely skewed.

Kurtosis:

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal
distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and
negative if the tails are "lighter" than for a normal distribution. The normal distribution has
kurtosis of zero.
Kurtosis characterizes the shape of a distribution - that is, its value does not depend on an
arbitrary change of the scale and location of the distribution. For example, kurtosis of a sample
(or population) of temperature values in Fahrenheit will not change if you transform the values
Celsius (the mean and the variance will, however, change).

The kurtosis of a distribution or sample is equal to the 4th central moment divided by the 4th
power of the standard deviation, minus 3.
To calculate the kurtosis of a sample:

i) subtract the mean from each value to get a set of deviations from the mean;

ii) divide each deviation by the standard deviation of all the deviations;
iii) average the 4th power of the deviations and subtract 3 from the result.

9
Excess Kurtosis

The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient

with that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative

(Platykurtic distribution), or near to zero (Mesokurtic distribution). Since normal distributions

have a kurtosis of 3, excess kurtosis is calculating by subtracting kurtosis by 3.

Excess kurtosis = Kurt 3

Types of excess kurtosis

1. Leptokurtic or heavy-tailed distribution (kurtosis more than normal distribution).


2. Mesokurtic (kurtosis same as the normal distribution).
3. Platykurtic or short-tailed distribution (kurtosis less than normal distribution).

Leptokurtic (kurtosis > 3)

Leptokurtic is having very long and skinny tails, which means there are more chances of outliers.

Positive values of kurtosis indicate that distribution is peaked and possesses thick tails. An

extreme positive kurtosis indicates a distribution where more of the numbers are located in the

tails of the distribution instead of around the mean.

10
platykurtic (kurtosis < 3)

Platykurtic having a lower tail and stretched around center tails means most of the data points are

present in high proximity with mean. A platykurtic distribution is flatter (less peaked) when

compared with the normal distribution.

Mesokurtic (kurtosis = 3)

Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In

Mesokurtic, distributions are moderate in breadth, and curves are a medium peaked height.

11
Box plot:

A box plot also known as Five Number Summary, summarizes data using the median, upper
quartile, lower quartile, and the minimum and maximum values. It allows you to see important
characteristics of the data at a glance(visually). This also help us to visualize outliers in the data
set.

Box plot or Five Number Summary has below five information.

1. Median

2. Lower Quartile(25th Percentile)

3. Upper Quartile(75th Percentile)

4. Minimum Value

5. Maximum Value

12
Working Example of Box

understand Box plot with this an example.

Step 1 take the set of numbers given

14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82, 95

Arrange the data in increasing(ascending) order

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

Step 2 Find the median of this data set. Median is mid value in this ordered data set.

13
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

Here it is 68.

Step 3 Lets find the Lower Quartile.

Lower Quartile is the median from the left of the, medium found in the Step 2(ie. 68)

(14, 19, 27, 50, 52, 54, 61), 68, 75, 82, 85, 87, 93, 95, 100

Lower Quartile is 50

Step 4 Lets find the Upper Quartile.

Upper Quartile is the median from the Right of the medium found in the Step 2(ie. 68)

14, 19, 27, 50, 52, 54, 61, 68,( 75, 82, 85, 87, 93, 95, 100)

Upper Quartile is 87

Step 5 Lets find the Minimum Value

It is value lies in the extreme left from this data set or first value in the data set after ordering.

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

14
Minimum Value is 14

Step 6 Lets find the Maximum Value

It is value lies in the extreme Right from this data set or last value in the data set after
ordering.

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

Maximum Value is 100

Range :

Range is basically spread of our data set.Range can be found as difference between Maximum
Value and Minimum Values.

15
Interquartile Range(IQR):

Interquartile Range(IQR) is difference between Upper quartile and Lower quartile.

As per picture above,our Lower quartile is 50 and Upper quartile is 87

Box plot with Even numbers of data set :

Step 1 :

We have 14 records below.

14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82

Arrange the data in increasing(ascending) order

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 100

Step 2:Since we the even number take the middle two values add them and divide them by 2.

Here Values at position 7 & 8 are middle values.

16
So our new median value is 64.5

Continue Step 3 to Step 6 to get the values mention in Working Example of Box
Plot section.Final Result as below

Pivot table:

A pivot table is a summary tool that wraps up or summarizes information sourced from bigger
tables. These bigger tables could be a database, an Excel spreadsheet, or any data that is or could
be converted in a table-like form. The data summarized in a pivot table might include sums,
averages, or other statistics which the pivot table groups together in a meaningful way.

17
Wide vs. Long Data

the first place.

Representing potentially multi-dimensional data in a single 2-dimensional table may require


some compromises. Either some data will be repeated (long data format) or your data set may
require blank cells (wide data).

Examples of pivot table:

Pivoting in Python with Pandas

Starting with the raw dataset loaded into


df_long_data

Note the key arguments here.

18
1. Index determines what will be unique in the leftmost column of the result.

2. Columns creates
of the input table.

3. Values defines what to put in the cells of the output table.

4. aggfunc defines how to combine the values (commonly sum, but other aggregation
functions are also used: min, max, mean, 95th percentile, mode). You can also define your
own aggregation functions and pass in the function.

Pivoting in Google Sheets

Pivoting in Google sheets is doing the exact same thing as our python code above but in a
graphical way. Instead of specifying arguments for the pivot_table function, we select from

Grouping in PostgreSQL

approximates the desired results but it does require some manual intervention to define the
possible groups.

Aggregation Functions

Sum is often used to combine data in pivot tables but pivot tables are much more flexible than

just simple sums. Different tools will provide their own selection of aggregation functions. For

example, Pandas provides: min, max, first, last, unique, std (standard deviation),

19
var (variance), count, unique, quantile among others. We can also define our own
aggregation functions and pass multiple different aggregation functions to the same column or
different columns in the same pivot table.

Missing Data

As with any aggregation missing data must be dealt with. In the example data used here there are
numerous months with 0
those month, not that there were

missing null

need to accept (and be aware of) the default behaviour for these cases or specify what to do.

Heat map:

Heatmaps visualize the data in a 2-dimensional format in the form of colored maps. The color
maps use hue, saturation, or luminance to achieve color variation to display various details. This
color variation gives visual cues to the readers about the magnitude of numeric values.

Heatmaps can describe the density or intensity of variables, visualize patterns, variance, and even
anomalies. Heatmaps show relationships between variables. These variables are plotted on both
axes.

Uses of HeatMap

Business Analytics: A heat map is used as a visual business analytics tool. A heat map gives
quick visual cues about the current results, performance, and scope for improvements. Heatmaps
can analyze the existing data and find areas of intensity that might reflect where most customers
reside, areas of risk of market saturation, or cold sites and sites that need a boost. Heat maps can

20
be continued to be updated to reflect the growth and efforts. These maps can be integrated into a

Website:
visualization helps business owners and marketers to identify the best & worst-performing
sections of a webpage. These insights help them with optimization.
Exploratory Data Analysis: EDA is a task performed by data scientists to get familiar
with the data. All the initial studies are done to understand the data are known as EDA.
Exploratory Data Analysis (EDA) is the process of analyzing datasets before the
modeling task. It is a tedious task to look at a spreadsheet filled with numbers and
determine essential characteristics in a dataset.

Molecular Biology: Heat maps are used to study disparity and similarity patterns in DNA,
RNA, etc.

Geovisualization: Geospatial heatmap charts are useful for displaying how geographical
areas of a map are compared to one another based on specific criteria. Heatmaps help in
cluster analysis or hotspot analysis to detect clusters of high concentrations of activity;
For example, Airbnb rental price analysis.

Marketing and Sales: to detect warm and cold spots is used to


improve marketing response rates by targeted marketing. Heatmaps allow the detection of
areas that respond to campaigns, under-served markets, customer residence, and high sale
trends, which helps optimize product lineups, capitalize on sales, create targeted customer
segments, and assess regional demographics.

Types of HeatMaps

Typically, there are two types of Heatmaps:

Grid Heatmap: The magnitudes of values shown through colors are laid out into a matrix of
rows and columns, mostly by a density-based function. Below are the types of Grid
Heatmaps.

21
Clustered Heatmap: The goal of Clustered Heatmap is to build associations between both
the data points and their features. This type of heatmap implements clustering as part of
the process of grouping similar features. Clustered Heatmaps are widely used in
biological sciences for studying gene similarities across individuals.

The order of the rows in Clustered Heatmap is determined by performing hierarchical cluster
analysis of the rows. Clustering positions similar rows together on the map. Similarly, the order
of the columns is determined.
Correlogram: A correlogram replaces each of the variables on the two axes with numeric
variables in the dataset. Each square depicts the relationship between the two intersecting
variables, which helps to build descriptive or predictive statistical models.

Spatial Heatmap: Each square in a Heatmap is assigned a color representation according to


the nearby value. The location of color is according to the magnitude of the value in
that particular space. These Heatmaps are data-
of an image. The cells with higher values
than other cells are given a hot color, while cells with lower values are assigned a cold color.

Correlation statistics:
Correlation

Correlation measures the relationship between two variables.

We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.

Correlation Coefficient

The correlation coefficient measures the relationship between two variables.

The correlation coefficient can never be less than -1 or higher than 1.

1 = there is a perfect linear relationship between the variables (like Average_Pulse


against Calorie_Burnage)

22
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)

Example of a Perfect Linear Relationship (Correlation Coefficient = 1)

We will use scatterplot to visualize the relationship between Average_Pulse and


Calorie_Burnage (we have used the small data set of the sports watch with 10
observations).
This time we want scatter plots, so we change kind to
"scatter": #Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')

import pandas as pd
import matplotlib.pyplot as plt

health_data = pd.read_csv("data.csv", header=0, sep=",")

health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='scatter'),

plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

output:

Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)

23
#Three lines to make our compiler able to draw:

import sys

import

matplotlib

matplotlib.use('Agg')

import pandas as pd

import matplotlib.pyplot as plt

negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1],

'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]}

negative_corr = pd.DataFrame(data=negative_corr)

negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage', kind='scatter')

plt.show()

#Two lines to make our compiler able to draw:

plt.savefig(sys.stdout.buffer)

sys.stdout.flush()

output:

24
We have plotted fictional data here. The x-axis represents the amount of hours worked at our job
before a training session. The y-axis is Calorie_Burnage.

If we work longer hours, we tend to have lower calorie burnage because we are exhausted before
the training session.

The correlation coefficient here is -1.

Example of No Linear Relationship (Correlation coefficient = 0)

#Three lines to make our compiler able to draw:

import sys

import matplotlib

matplotlib.use('Agg')

import pandas as pd

import matplotlib.pyplot as plt

full_health_data = pd.read_csv("data.csv", header=0, sep=",")

25
full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')

plt.show()

#Two lines to make our compiler able to draw:

plt.savefig(sys.stdout.buffer)

sys.stdout.flush()

output:

ANOVA:

ANOVA stands for analysis of variance and, as the name suggests, it helps us understand and
compare variances among groups. Before going in detail about ANOVA, remember a few
terms in statistics:

26
Mean: The average of all values.

27
Variance: A measure of the variation among values. It is calculated by adding up squared
differences of each value and the mean and then dividing the sum by the number of
samples.

Standard deviation: The square root of variance.

In order to understand the motivation behind ANOVA, or some other statistical tests, we should
learn two simple terms: population and sample.

Population is all elements in a group. For example,

College students in US is a population that includes all of the college students in US.

25-year-old people in Europe is a population that includes all of the people that fits the
description.

It is not always feasible or possible to do analysis on population because we cannot collect all
the data of a population. Therefore, we use samples.

Sample is a subset of a population. For example,

28
1000 college students in US is a subset of students in population.

When we compare two samples (or groups), we can use t-test to see if there is any difference in
means of groups. When we have more than two groups, t-test is not the optimal choice because
we need to apply t-test to pairs separately. Consider we have groups A, B and C. To be able to
compare the means, we need to apply a t-test to A-B, A-C and B-C. As the number of groups
increase, this becomes harder to manage.

In the case of comparing three or more groups, ANOVA is preferred. There are two elements of
ANOVA:

Variation within each group

Variation between groups

ANOVA test result is based on F ratio which is the ratio of the variation between groups to the
variation within groups.

F ratio shows how much of the total variation comes from the variation between groups and how
much comes from the variation within groups.

29

You might also like