0% found this document useful (0 votes)
17 views

Computational Data Science - Unit 4

This document discusses hypothesis testing, its types, and statistical tests used in data analysis, emphasizing the importance of forming null and alternative hypotheses. It outlines the steps involved in hypothesis testing, including data collection, conducting statistical tests, and interpreting p-values. The document also covers normal distribution, parametric and non-parametric tests, and their respective assumptions and applications.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Computational Data Science - Unit 4

This document discusses hypothesis testing, its types, and statistical tests used in data analysis, emphasizing the importance of forming null and alternative hypotheses. It outlines the steps involved in hypothesis testing, including data collection, conducting statistical tests, and interpreting p-values. The document also covers normal distribution, parametric and non-parametric tests, and their respective assumptions and applications.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

COMPUTATIONAL DATA SCIENCE (ITITE68)

UNIT – 4

Statistical test: Hypothesis testing and its types, p-values, and confidence intervals,
statistical assumptions, normal distribution, parametric test, non-normal distribution
and its types (Beta Distribution, Exponential Distribution, Gamma Distribution, Log-
Normal Distribution, Logistic Distribution, Maxwell-Boltzmann Distribution, Poisson
Distribution), non-parametric tests, choosing the right statistical test

In today’s data-driven world, decisions are based on data all the time. Hypothesis plays a
crucial role in that process, whether it may be making business decisions, in the health sector,
academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing
the wrong conclusions and making bad decisions.
➢ Hypothesis Testing is a type of statistical analysis in which you put your
assumptions about a population parameter to the test. It is used to estimate the
relationship between 2 statistical variables.

Let's discuss few examples of statistical hypothesis from real-life -

• A teacher assumes that 60% of his college's students come from lower-middle-class
families.

• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic
patients.

How Hypothesis Testing Works?

• An analyst performs hypothesis testing on a statistical sample to present evidence of


the plausibility of the null hypothesis. Measurements and analyses are conducted on a
random sample of the population to test a theory. Analysts use a random population
sample to test two hypotheses: the null and alternative hypotheses.

• The null hypothesis is typically an equality hypothesis between population parameters;


for example, a null hypothesis may claim that the population means return equals zero.
The alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the
population means the return is not equal to zero). As a result, they are mutually
exclusive, and only one can be correct. One of the two possibilities, however, will
always be correct.

Null Hypothesis and Alternate Hypothesis


➢ The Null Hypothesis is the assumption that the event will not occur. A null hypothesis
has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.


➢ The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance
of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average.

To put this company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

➢ Another straightforward example to understand this concept is determining whether or


not a coin is fair and balanced. The null hypothesis states that the probability of a show
of heads is equal to the likelihood of a show of tails. In contrast, the alternate theory
states that the probability of a show of heads and tails would be very different.

Steps of Hypothesis Testing

Step 1: Specify Your Null and Alternate Hypotheses

It is critical to rephrase your original research hypothesis (the prediction that you wish to study)
as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first
hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The
null hypothesis predicts no link between the variables of interest.

Step 2: Gather Data

For a statistical test to be legitimate, sampling and data collection must be done in a way that
is meant to test your hypothesis. You cannot draw statistical conclusions about the population
you are interested in if your data is not representative.

Step 3: Conduct a Statistical Test

Other statistical tests are available, but they all compare within-group variance (how to spread
out the data inside a category) against between-group variance (how different the categories
are from one another). If the between-group variation is big enough that there is little or no
overlap between groups, your statistical test will display a low p-value to represent this. This
suggests that the disparities between these groups are unlikely to have occurred by accident.
Alternatively, if there is a large within-group variance and a low between-group variance, your
statistical test will show a high p-value. Any difference you find across groups is most likely
attributable to chance. The variety of variables and the level of measurement of your obtained
data will influence your statistical test selection.

Step 4: Determine Rejection of Your Null Hypothesis

Your statistical test results must determine whether your null hypothesis should be rejected or
not. In most circumstances, you will base your judgment on the p-value provided by the
statistical test. In most circumstances, your preset level of significance for rejecting the null
hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would
be seen if the null hypothesis were true. In other circumstances, researchers use a lower level
of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null
hypothesis.

Step 5: Present Your Results

The findings of hypothesis testing will be discussed in the results and discussion portions of
your research paper, dissertation, or thesis. You should include a concise overview of the data
and a summary of the findings of your statistical test in the results section. You can talk about
whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or
failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a
must for your statistics assignments.
Types of Hypothesis Testing

Z Test

To determine whether a discovery or relationship is statistically significant, hypothesis testing


uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only
when the population standard deviation is known and the sample size is 30 data points or more,
can a z-test be applied.

T Test

A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.

Chi-Square

You utilize a Chi-square test for hypothesis testing concerning whether your data is as
predicted. To determine if the expected and observed results are well-fitted, the Chi-square test
analyzes the differences between categorical variables from a random sample. The test's
fundamental premise is that the observed values in your data should be compared to the
predicted values that would be present if the null hypothesis were true.
Hypothesis Testing and Confidence Intervals

• Both confidence intervals and hypothesis tests are inferential techniques that depend on
approximating the sample distribution.

• Data from a sample is used to estimate a population parameter using confidence


intervals. Data from a sample is used in hypothesis testing to examine a given
hypothesis. We must have a postulated parameter to conduct hypothesis testing.

• Bootstrap distributions and randomization distributions are created using comparable


simulation techniques. The observed sample statistic is the focal point of a bootstrap
distribution, whereas the null hypothesis value is the focal point of a randomization
distribution.

• A variety of feasible population parameter estimates are included in confidence ranges.


There is a direct connection between these two-tail confidence intervals and these two-
tail hypothesis tests. The results of a two-tailed hypothesis test and two-tailed
confidence intervals typically provide the same results.

• In other words, a hypothesis test at the 0.05 level will virtually always fail to reject the
null hypothesis if the 95% confidence interval contains the predicted value. A
hypothesis test at the 0.05 level will nearly certainly reject the null hypothesis if the
95% confidence interval does not include the hypothesized parameter.

P-Value

➢ A p-value is a metric that expresses the likelihood that an observed difference could
have occurred by chance. As the p-value decreases the statistical significance of the
observed difference increases. If the p-value is too low, you reject the null hypothesis.

➢ Here you have taken an example in which you are trying to test whether the new
advertising campaign has increased the product's sales. The p-value is the likelihood
that the null hypothesis, which states that there is no change in the sales due to the new
advertising campaign, is true. If the p-value is .30, then there is a 30% chance that there
is no increase or decrease in the product's sales. If the p-value is 0.03, then there is a
3% probability that there is no increase or decrease in the sales value due to the new
advertising campaign. As you can see, the lower the p-value, the chances of the alternate
hypothesis being true increases, which means that the new advertising campaign causes
an increase or decrease in sales.

Normal Distribution
• In probability theory and statistics, the Normal Distribution, also called the Gaussian
Distribution, is the most significant continuous probability distribution. Sometimes it
is also called a bell curve.
• A large number of random variables are either nearly or exactly represented by the
normal distribution, in every physical science and economics. Furthermore, it can
be used to approximate other probability distributions, therefore supporting the usage
of the word ‘normal ‘as in about the one, mostly used.

Normal Distribution Definition


The Normal Distribution is defined by the probability density function for a continuous random
variable in a system. Let us say, f(x) is the probability density function and X is the random
variable. Hence, it defines a function which is integrated between the range or interval (x to x
+ dx), giving the probability of random variable X, by considering the values between x and
x+dx.
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
And -∞∫+∞ f(x)=1

Normal Distribution Formula


The probability density function of normal or gaussian distribution is given by;

Where,

• x is the variable
• μ is the mean
• σ is the standard deviation

Normal Distribution Curve


• The random variables following the normal distribution are those whose values can find
any unknown value in a given range.
• For example, finding the height of the students in the school. Here, the distribution can
consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is
forced physically in our query.
• Whereas, the normal distribution doesn’t even bother about the range. The range can
also extend to –∞ to + ∞ and still we can find a smooth curve. These random variables
are called Continuous Variables, and the Normal Distribution then provides here
probability of the value lying in a particular range for a given experiment.
• Also, use the normal distribution calculator to find the probability density function by
just providing the mean and standard deviation value.
Normal Distribution Standard Deviation
• Generally, the normal distribution has any positive standard deviation. We know that
the mean helps to determine the line of symmetry of a graph, whereas the standard
deviation helps to know how far the data are spread out.
• If the standard deviation is smaller, the data are somewhat close to each other and the
graph becomes narrower.
• If the standard deviation is larger, the data are dispersed more, and the graph becomes
wider. The standard deviations are used to subdivide the area under the normal curve.
Each subdivided section defines the percentage of data, which falls into the specific
region of a graph.
Using 1 standard deviation, the Empirical Rule states that,

• Approximately 68% of the data falls within one standard deviation of the mean. (i.e.,
Between Mean- one Standard Deviation and Mean + one standard deviation)
• Approximately 95% of the data falls within two standard deviations of the mean. (i.e.,
Between Mean- two Standard Deviation and Mean + two standard deviations)
• Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e.,
Between Mean- three Standard Deviation and Mean + three standard deviations)

Thus, the empirical rule is also called the 68 – 95 – 99.7 rule.

Normal Distribution Properties


Some of the important properties of the normal distribution are listed below:

• In a normal distribution, the mean, median and mode are equal (i.e., Mean = Median=
Mode).
• The total area under the curve should be equal to 1.
• The normally distributed curve should be symmetric at the centre.
• There should be exactly half of the values are to the right of the centre and exactly half
of the values are to the left of the centre.
• The normal distribution should be defined by the mean and standard deviation.
• The normal distribution curve must have only one peak. (i.e., Unimodal)
• The curve approaches the x-axis, but it never touches, and it extends farther away from
the mean.

Applications
The normal distributions are closely associated with many things such as:

• Marks scored on the test


• Heights of different persons
• Size of objects produced by the machine
• Blood pressure and so on.

Parametric and non-parametric tests

• Statistical tests are statistical methods that help us reject or not reject our null
hypothesis. They’re based on probability distributions and can be one-tailed or two-
tailed, depending on the hypotheses that we’ve chosen.

There are other ways in which statistical tests can differ and one of them is based on their
assumptions of the probability distribution that the data in question follows.

• Parametric tests are those statistical tests that assume the data approximately
follows a normal distribution, amongst other assumptions (examples include z-
test, t-test, ANOVA). Important note — the assumption is that the data of the whole
population follows a normal distribution, not the sample data that you’re working
with.

• Nonparametric tests are those statistical tests that don’t assume anything about
the distribution followed by the data, and hence are also known as distribution
free tests (examples include Chi-square, Mann-Whitney U). Nonparametric tests
are based on the ranks held by different data points.

• Common parametric tests are focused on analyzing and comparing the mean or
variance of data.

• The mean is the most commonly used measure of central tendency to describe data,
however it is also heavily impacted by outliers. Thus, it is important to analyze your
data and determine whether the mean is the best way to represent it. If yes, then
parametric tests are the way to go! If not, and the median better represents your
data, then nonparametric tests might be the better option.
As mentioned above, parametric tests have a couple of assumptions that need to be met by the
data:

1. Normality — the sample data come from a population that approximately follows
a normal distribution

2. Homogeneity of variance — the sample data come from a population with the
same variance

3. Independence — the sample data consists of independent observations and are


sampled randomly

4. Outliers — the sample data don’t contain any extreme outliers

Advantages and Disadvantages of Parametric and Nonparametric Tests


A lot of individuals accept that the choice between using parametric or nonparametric tests
relies upon whether your information is normally distributed. The distribution can act as a
deciding factor in case the data set is relatively small. Although, in a lot of cases, this issue isn't
a critical issue because of the following reasons:
• Parametric tests help in analyzing non normal appropriations for a lot of datasets.
• Nonparametric tests when analyzed have other firm conclusions that are harder to
achieve.
The appropriate response is usually dependent upon whether the mean or median is chosen to
be a better measure of central tendency for the distribution of the data.
• A parametric test is considered when you have the mean value as your central value
and the size of your data set is comparatively large. This test helps in making powerful
and effective decisions.
• A non-parametric test is considered regardless of the size of the data set if the median
value is better when compared to the mean value.
Ultimately, if your sample size is small, you may be compelled to use a nonparametric test.

Classification Of Parametric Test and Non-Parametric Test


There are different kinds of parametric tests and non-parametric tests to check the data.
Types Of Parametric Test
• Student's T-Test: - This test is used when the samples are small and population
variances are unknown. The test is used to do a comparison between two means and
proportions of small independent samples and between the population mean and sample
mean.
• 1 Sample T-Test: - Through this test, the comparison between the specified value and
meaning of a single group of observations is done.
• Unpaired 2 Sample T-Test: - The test is performed to compare the two means of two
independent samples. These samples came from the normal populations having the
same or unknown variances.
• Paired 2 Sample T-Test: - In the case of paired data of observations from a single
sample, the paired 2 sample t-test is used.
• ANOVA: - Analysis of variance is used when the difference in the mean values of more
than two groups is given.
• One Way ANOVA: - This test is useful when different testing groups differ by only
one factor.
• Two Way ANOVA: - When various testing groups differ by two or more factors, then
a two-way ANOVA test is used.
• Pearson's Correlation Coefficient: - This coefficient is the estimation of the strength
between two variables. The test is used in finding the relationship between two
continuous and quantitative variables.
• Z - Test: - The test helps measure the difference between two means.
• Z - Proportionality Test: - It is used in calculating the difference between two
proportions.

Types Of Non-Parametric Test


• 1 Sample Sign Test: - In this test, the median of a population is calculated and is
compared to the target value or reference value.
• 1 Sample Wilcoxon Signed Rank Test: - Through this test also, the population median
is calculated and compared with the target value but the data used is extracted from the
symmetric distribution.
• Friedman Test: - The difference of the groups having ordinal dependent variables is
calculated. This test is used for continuous data.
• Goodman Kruska's Gamma: - It is a group test used for ranked variables.
• Kruskal-Wallis Test: - This test is used when two or more medians are different. For
the calculations in this test, ranks of the data points are used.
• The Mann-Kendall Trend Test: - The test helps in finding the trends in time-series
data.
• Mann-Whitney Test: - To compare differences between two independent groups, this
test is used. The condition used in this test is that the dependent values must be
continuous or ordinal.
• Mood's Median Test: - This test is used when there are two independent samples.
• Spearman Rank Correlation: - This technique is used to estimate the relation between
two sets of data.
Applications Of Parametric Tests
• This test is used when the given data is quantitative and continuous.

• When the data is of normal distribution then this test is used.


• The parametric tests are helpful when the data is estimated on the approximate ratio or
interval scales of measurement.

Applications Of Non-Parametric Tests


• These tests are used in the case of solid mixing to study the sampling results.

• The tests are helpful when the data is estimated with different kinds of measurement
scales.
• The non-parametric tests are used when the distribution of the population is unknown.

Non-Normal Distribution
• Although the normal distribution takes center stage in statistics, many processes follow
a non normal distribution. This can be due to the data naturally following a specific
type of non-normal distribution (for example, bacteria growth naturally follows
an exponential distribution).
What is a Beta Distribution?

• A Beta distribution is a versatile way to represent outcomes for percentages or


proportions. For example, how likely is it that a rogue candidate will win the next
Presidential election? You might think the probability is 0.2. Your friend might think
it’s 0.15. The beta distribution gives you a way to describe this.

One reason that this function is confusing is there are three “Betas” to contend with, and they
all have different meanings:

• Beta (α, β): the name of the probability distribution.


• B (α, β): the name of a function in the denominator of the pdf. This acts as a
“normalizing constant” to ensure that the area under the curve of the pdf equals
1.
• β: the name of the second shape parameter in the pdf.

The basic beta distribution is also called the beta distribution of the first kind. Beta distribution
of the second kind is another name for the beta prime distribution.

Probability Density Function

The general formula for the probability density function is:


where:
α and β are two positives shape parameters which control the shape of the distribution.

What is an Exponential distribution?

• The exponential distribution (also called the negative exponential distribution) is a


probability distribution that describes time between events in a Poisson process.
• There is a strong relationship between the Poisson distribution and the Exponential
distribution. For example, let’s say a Poisson distribution models the number of births
in a given time period. The time in between each birth can be modelled with an
exponential distribution (Young & Young, 1998).

What is a Poisson Process?

• The Poisson process gives you a way to find probabilities for random points in time for
a process. A “process” could be almost anything:

• Accidents at an interchange.
• File requests on a server.
• Customers arriving at a store.
• Battery failure and replacement.

• The Poisson process can tell you when one of these random points in time will likely
happen. For example, when customers will arrive at a store, or when a battery might
need to be replaced. It’s basically a counting process; it counts the number of times an
event has occurred since a given point in time, like 1210 customers since 1 p.m., or 543
files since noon. An assumption for the process is that it is only used for independent
events.

Poisson Process Example

Let’s say a store is interested in the number of customers per hour. Arrivals per hour has a
Poisson 120 arrival rate, which means that 120 customers arrive per hour. This could also be
worded as “The expected mean inter-arrival time is 0.5 minutes”, because a customer can be
expected every half minute — 30 seconds.

The negative exponential distribution models this process, so we could write:


Poisson 120 = Exponential 0.5
So the units for the Poisson process are people and the units for the exponential are minutes.

The 120 customers per hour could also be broken down into categories, like men (probability
of 0.5), women (probability of 0.3) and children (probability of 0.2). You could calculate the
time interval between the arrivals times of women, men, or children.
For example, the time between arrivals of women would be:
A rate 0.3*120 = 36 per hour, or one every 100 seconds.
For men it would be:
A rate of 0.5*120 = 60 per hour, or one every minute.
And for children:
A rate of 0.2*120 = 24 per hour, or one every 150 seconds.

What is the Exponential Distribution used for?

• The exponential distribution is mostly used for testing product reliability. It’s also an
important distribution for building continuous-time Markov chains. The exponential
often models waiting times and can help you to answer questions like:

• “How much time will go by before a major hurricane hits the Atlantic
Seaboard?” or
• “How long will the transmission in my car last before it breaks?”.

• If you assume that the answer to these questions is unknown, you can think of the
elapsed time as a random variable with an exponential distribution as long as the events
occur continuously and independently at a constant rate.
The exponential distribution can model area, as well as time, between events. Young and
Young (198) give an example of how the exponential distribution can also model space
between events (instead of time between events). Say that grass seeds are randomly dispersed
over a field; Each point in the field has an equal chance of a seed landing there. Now select a
point P, then measure the distance to the nearest seed. This distance becomes the radius for a
circle with point P at the center. The circle’s area has an exponential distribution.

Gamma Distribution

The gamma distribution is a family of right-skewed, continuous probability distributions.


These distributions are useful in real-life where something has a natural minimum of 0. For
example, it is commonly used in finance, for elapsed times, or during Poisson processes.

If X is a continuous random variable then the probability distribution function is:

Where

• Γ(x) = the gamma function, .

• α = The shape parameter.


• β (sometimes θ is used instead) = The rate parameter (the reciprocal of the scale
parameter).
α and β are both greater than 1.
When α = 1, this becomes the exponential distribution.
When β = 1 this becomes the standard gamma distribution.

Alpha and beta define the shape of the graph. Although they both have an effect on the shape,
a change in β will show a sharp change, as shown by the pink and blue lines in this graph:
• You can think of α as the number of events you are waiting for (although α can be
any positive number — not just integers), and β as the mean waiting time until the
first event.

• If α (number of events) remains unchanged but β (mean time between events) increases,
it makes sense that the graph shifts to the right as waiting times will lengthen.

• Similarly, if the mean waiting time (β) stays the same but the number of events (α)
increases, the graph will also shift to the right. As α approaches infinity, the gamma
closely matches the normal distribution.

Mean, Variance, MGF

Mean: E(X) = αβ
Variance: var(X) = αβ2
Moment generating function: MX (t) = 1 /(1 − βt)α

How to Find Gamma Distribution Probabilities in Excel

Step 1: Type “=GAMMA.DIST(” into an empty cell.


Step 2: Type the value where you want to find the probability. For example, if you want to
find the probability at x=6, the function becomes “=GAMMA.DIST( 6”
Step 3: Type your α and β values, separated by a comma. For example, if your α is 3 and β is
2, the function becomes: “=GAMMA.DIST( 6, 3, 2”
Step 4: Type FALSE, close the parentheses and then hit the “enter” key. The entire function
is “=GAMMA.DIST( 6, 3, 2, FALSE” Excel will return the probability as 0.112020904.
Lognormal Distribution
• A lognormal (log-normal or Galton) distribution is a probability distribution with a
normally distributed logarithm. A random variable is lognormally distributed if its
logarithm is normally distributed.

Skewed distributions with low mean values, large variance, and all-positive values often fit this
type of distribution. Values must be positive as log(x) exists only for positive values of x.

The probability density function is defined by the mean μ and standard deviation, σ:

The shape of the lognormal distribution is defined by three parameters:

• σ, the shape parameter. Also, the standard deviation for the lognormal, this
affects the general shape of the distribution. Usually, these parameters are
known from historical data. Sometimes, you might be able to estimate it with
current data. The shape parameter doesn’t change the location or height of the
graph; it just affects the overall shape.
• m, the scale parameter (this is also the median). This parameter shrinks or
stretches the graph.
• Θ (or μ), the location parameter, which tells you where on the x-axis the graph
is located.
The standard lognormal distribution has a location parameter of 0 and a scale parameter of 1
(shown in blue in the image below). If Θ = 0, the distribution is called a 2-parameter lognormal
distribution.

When are Lognormal Distributions Used?

• The most commonly used (and the most familiar) distribution in science is the normal
distribution. The familiar “bell curve” models many natural phenomenon, from the
simple (weights or heights) to the more complex. For example, the following
phenomenon can all be modeled with a lognormal distribution:
• Milk production by cows.
• Lives of industrial units with failure modes that are characterized by fatigue-
stress.
• Amounts of rainfall.
• Size distributions of rainfall droplets.
• The volume of gas in a petroleum reserve.
Many more phenomenon can be modeled with the lognormal distribution, such as the length
of latent periods of infectious disease or species abundance.

Logistic Distribution

• The logistic distribution is used for modeling growth, and also for logistic regression.
It is a symmetrical distribution, unimodal (it has one peak) and is similar in shape to
the normal distribution.
• In fact, the logistic and normal distributions are so close in shape (although the logistic
tends to have slightly fatter tails) that for most applications it’s impossible to tell one
from the other.

Why is it Used?

• The logistic distribution is mainly used because the curve has a relatively
simple cumulative distribution formula to work with. The formula approximates the
normal distribution extremely well.
• Finding cumulative probabilities for the normal distribution usually involves looking
up values in the z-table, rounding up or down to the nearest z-score. Exact values are
usually found with statistical software, because the cumulative distribution function is
so difficult to work with, involving integration:

• Although there are many other functions that can approximate the normal, they also
tend to have very complicated mathematical formulas. The logistic distribution, in
comparison, has a much simpler CDF formula:

Two parameters define the shape of the distribution:

• The location parameter (μ) tells you where it’s centered on the x-axis.
• The scale parameter (σ) tells you what the spread is. In the above equation, s is
a scale parameter proportional to the standard deviation.
The probability density function for the distribution is:

Maxwell-Boltzmann Distribution
• The Maxwell-Boltzmann distribution is a probability distribution mainly used in
physics and statistical mechanics.
• The distribution, which describes the kinetic energy of air particles in a particular
sample, was devised by James C. Maxwell and Ludwig Boltzmann.
• In simple terms, the kinetic energy of a particle in a system is a particle’s speed. The
area under the distribution represents the total number of particles in a sample.

Poisson Distribution
• A Poisson distribution is a tool that helps to predict the probability of certain events
happening when you know how often the event has occurred.
• It gives us the probability of a given number of events happening in a fixed interval
of time.

Poisson distributions, valid only for integers on the horizontal axis. λ (also written as μ) is the
expected number of event occurrences.

Practical Uses of the Poisson Distribution

• A textbook store rents an average of 200 books every Saturday night. Using this data,
you can predict the probability that more books will sell (perhaps 300 or 400) on the
following Saturday nights.

• Another example is the number of diners in a certain restaurant every day. If


the average number of diners for seven days is 500, you can predict the probability of
a certain day having more customers.
✓ Because of this application, Poisson distributions are used by businessmen to
make forecasts about the number of customers or sales on certain days or seasons of
the year.

✓ In business, overstocking will sometimes mean losses if the goods are not sold.
Likewise, having too few stocks would still mean a lost business opportunity because
you were not able to maximize your sales due to a shortage of stock. By using this tool,
businessmen are able to estimate the time when demand is unusually higher, so they
can purchase more stock.

✓ Hotels and restaurants could prepare for an influx of customers, they could hire extra
temporary workers in advance, purchase more supplies, or make contingency plans just
in case they cannot accommodate their guests coming to the area.
With the Poisson distribution, companies can adjust supply to demand in order to keep
their business earning good profit. In addition, waste of resources is prevented.

Calculating the Poisson Distribution

The Poisson Distribution pmf is: P(x; μ) = (e-μ * μx) / x!

Where:

• The symbol “!” is a factorial.


• μ (the expected number of occurrences) is sometimes written as λ. Sometimes
called the event rate or rate parameter.

You might also like