Computational Data Science - Unit 4
Computational Data Science - Unit 4
UNIT – 4
Statistical test: Hypothesis testing and its types, p-values, and confidence intervals,
statistical assumptions, normal distribution, parametric test, non-normal distribution
and its types (Beta Distribution, Exponential Distribution, Gamma Distribution, Log-
Normal Distribution, Logistic Distribution, Maxwell-Boltzmann Distribution, Poisson
Distribution), non-parametric tests, choosing the right statistical test
In today’s data-driven world, decisions are based on data all the time. Hypothesis plays a
crucial role in that process, whether it may be making business decisions, in the health sector,
academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing
the wrong conclusions and making bad decisions.
➢ Hypothesis Testing is a type of statistical analysis in which you put your
assumptions about a population parameter to the test. It is used to estimate the
relationship between 2 statistical variables.
• A teacher assumes that 60% of his college's students come from lower-middle-class
families.
• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic
patients.
A sanitizer manufacturer claims that its product kills 95 percent of germs on average.
To put this company's claim to the test, create a null and alternate hypothesis.
It is critical to rephrase your original research hypothesis (the prediction that you wish to study)
as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first
hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The
null hypothesis predicts no link between the variables of interest.
For a statistical test to be legitimate, sampling and data collection must be done in a way that
is meant to test your hypothesis. You cannot draw statistical conclusions about the population
you are interested in if your data is not representative.
Other statistical tests are available, but they all compare within-group variance (how to spread
out the data inside a category) against between-group variance (how different the categories
are from one another). If the between-group variation is big enough that there is little or no
overlap between groups, your statistical test will display a low p-value to represent this. This
suggests that the disparities between these groups are unlikely to have occurred by accident.
Alternatively, if there is a large within-group variance and a low between-group variance, your
statistical test will show a high p-value. Any difference you find across groups is most likely
attributable to chance. The variety of variables and the level of measurement of your obtained
data will influence your statistical test selection.
Your statistical test results must determine whether your null hypothesis should be rejected or
not. In most circumstances, you will base your judgment on the p-value provided by the
statistical test. In most circumstances, your preset level of significance for rejecting the null
hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would
be seen if the null hypothesis were true. In other circumstances, researchers use a lower level
of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null
hypothesis.
The findings of hypothesis testing will be discussed in the results and discussion portions of
your research paper, dissertation, or thesis. You should include a concise overview of the data
and a summary of the findings of your statistical test in the results section. You can talk about
whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or
failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a
must for your statistics assignments.
Types of Hypothesis Testing
Z Test
T Test
A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.
Chi-Square
You utilize a Chi-square test for hypothesis testing concerning whether your data is as
predicted. To determine if the expected and observed results are well-fitted, the Chi-square test
analyzes the differences between categorical variables from a random sample. The test's
fundamental premise is that the observed values in your data should be compared to the
predicted values that would be present if the null hypothesis were true.
Hypothesis Testing and Confidence Intervals
• Both confidence intervals and hypothesis tests are inferential techniques that depend on
approximating the sample distribution.
• In other words, a hypothesis test at the 0.05 level will virtually always fail to reject the
null hypothesis if the 95% confidence interval contains the predicted value. A
hypothesis test at the 0.05 level will nearly certainly reject the null hypothesis if the
95% confidence interval does not include the hypothesized parameter.
P-Value
➢ A p-value is a metric that expresses the likelihood that an observed difference could
have occurred by chance. As the p-value decreases the statistical significance of the
observed difference increases. If the p-value is too low, you reject the null hypothesis.
➢ Here you have taken an example in which you are trying to test whether the new
advertising campaign has increased the product's sales. The p-value is the likelihood
that the null hypothesis, which states that there is no change in the sales due to the new
advertising campaign, is true. If the p-value is .30, then there is a 30% chance that there
is no increase or decrease in the product's sales. If the p-value is 0.03, then there is a
3% probability that there is no increase or decrease in the sales value due to the new
advertising campaign. As you can see, the lower the p-value, the chances of the alternate
hypothesis being true increases, which means that the new advertising campaign causes
an increase or decrease in sales.
Normal Distribution
• In probability theory and statistics, the Normal Distribution, also called the Gaussian
Distribution, is the most significant continuous probability distribution. Sometimes it
is also called a bell curve.
• A large number of random variables are either nearly or exactly represented by the
normal distribution, in every physical science and economics. Furthermore, it can
be used to approximate other probability distributions, therefore supporting the usage
of the word ‘normal ‘as in about the one, mostly used.
Where,
• x is the variable
• μ is the mean
• σ is the standard deviation
• Approximately 68% of the data falls within one standard deviation of the mean. (i.e.,
Between Mean- one Standard Deviation and Mean + one standard deviation)
• Approximately 95% of the data falls within two standard deviations of the mean. (i.e.,
Between Mean- two Standard Deviation and Mean + two standard deviations)
• Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e.,
Between Mean- three Standard Deviation and Mean + three standard deviations)
• In a normal distribution, the mean, median and mode are equal (i.e., Mean = Median=
Mode).
• The total area under the curve should be equal to 1.
• The normally distributed curve should be symmetric at the centre.
• There should be exactly half of the values are to the right of the centre and exactly half
of the values are to the left of the centre.
• The normal distribution should be defined by the mean and standard deviation.
• The normal distribution curve must have only one peak. (i.e., Unimodal)
• The curve approaches the x-axis, but it never touches, and it extends farther away from
the mean.
Applications
The normal distributions are closely associated with many things such as:
• Statistical tests are statistical methods that help us reject or not reject our null
hypothesis. They’re based on probability distributions and can be one-tailed or two-
tailed, depending on the hypotheses that we’ve chosen.
There are other ways in which statistical tests can differ and one of them is based on their
assumptions of the probability distribution that the data in question follows.
• Parametric tests are those statistical tests that assume the data approximately
follows a normal distribution, amongst other assumptions (examples include z-
test, t-test, ANOVA). Important note — the assumption is that the data of the whole
population follows a normal distribution, not the sample data that you’re working
with.
• Nonparametric tests are those statistical tests that don’t assume anything about
the distribution followed by the data, and hence are also known as distribution
free tests (examples include Chi-square, Mann-Whitney U). Nonparametric tests
are based on the ranks held by different data points.
• Common parametric tests are focused on analyzing and comparing the mean or
variance of data.
• The mean is the most commonly used measure of central tendency to describe data,
however it is also heavily impacted by outliers. Thus, it is important to analyze your
data and determine whether the mean is the best way to represent it. If yes, then
parametric tests are the way to go! If not, and the median better represents your
data, then nonparametric tests might be the better option.
As mentioned above, parametric tests have a couple of assumptions that need to be met by the
data:
1. Normality — the sample data come from a population that approximately follows
a normal distribution
2. Homogeneity of variance — the sample data come from a population with the
same variance
• The tests are helpful when the data is estimated with different kinds of measurement
scales.
• The non-parametric tests are used when the distribution of the population is unknown.
Non-Normal Distribution
• Although the normal distribution takes center stage in statistics, many processes follow
a non normal distribution. This can be due to the data naturally following a specific
type of non-normal distribution (for example, bacteria growth naturally follows
an exponential distribution).
What is a Beta Distribution?
One reason that this function is confusing is there are three “Betas” to contend with, and they
all have different meanings:
The basic beta distribution is also called the beta distribution of the first kind. Beta distribution
of the second kind is another name for the beta prime distribution.
• The Poisson process gives you a way to find probabilities for random points in time for
a process. A “process” could be almost anything:
• Accidents at an interchange.
• File requests on a server.
• Customers arriving at a store.
• Battery failure and replacement.
• The Poisson process can tell you when one of these random points in time will likely
happen. For example, when customers will arrive at a store, or when a battery might
need to be replaced. It’s basically a counting process; it counts the number of times an
event has occurred since a given point in time, like 1210 customers since 1 p.m., or 543
files since noon. An assumption for the process is that it is only used for independent
events.
Let’s say a store is interested in the number of customers per hour. Arrivals per hour has a
Poisson 120 arrival rate, which means that 120 customers arrive per hour. This could also be
worded as “The expected mean inter-arrival time is 0.5 minutes”, because a customer can be
expected every half minute — 30 seconds.
The 120 customers per hour could also be broken down into categories, like men (probability
of 0.5), women (probability of 0.3) and children (probability of 0.2). You could calculate the
time interval between the arrivals times of women, men, or children.
For example, the time between arrivals of women would be:
A rate 0.3*120 = 36 per hour, or one every 100 seconds.
For men it would be:
A rate of 0.5*120 = 60 per hour, or one every minute.
And for children:
A rate of 0.2*120 = 24 per hour, or one every 150 seconds.
• The exponential distribution is mostly used for testing product reliability. It’s also an
important distribution for building continuous-time Markov chains. The exponential
often models waiting times and can help you to answer questions like:
• “How much time will go by before a major hurricane hits the Atlantic
Seaboard?” or
• “How long will the transmission in my car last before it breaks?”.
• If you assume that the answer to these questions is unknown, you can think of the
elapsed time as a random variable with an exponential distribution as long as the events
occur continuously and independently at a constant rate.
The exponential distribution can model area, as well as time, between events. Young and
Young (198) give an example of how the exponential distribution can also model space
between events (instead of time between events). Say that grass seeds are randomly dispersed
over a field; Each point in the field has an equal chance of a seed landing there. Now select a
point P, then measure the distance to the nearest seed. This distance becomes the radius for a
circle with point P at the center. The circle’s area has an exponential distribution.
Gamma Distribution
Where
Alpha and beta define the shape of the graph. Although they both have an effect on the shape,
a change in β will show a sharp change, as shown by the pink and blue lines in this graph:
• You can think of α as the number of events you are waiting for (although α can be
any positive number — not just integers), and β as the mean waiting time until the
first event.
• If α (number of events) remains unchanged but β (mean time between events) increases,
it makes sense that the graph shifts to the right as waiting times will lengthen.
• Similarly, if the mean waiting time (β) stays the same but the number of events (α)
increases, the graph will also shift to the right. As α approaches infinity, the gamma
closely matches the normal distribution.
Mean: E(X) = αβ
Variance: var(X) = αβ2
Moment generating function: MX (t) = 1 /(1 − βt)α
Skewed distributions with low mean values, large variance, and all-positive values often fit this
type of distribution. Values must be positive as log(x) exists only for positive values of x.
The probability density function is defined by the mean μ and standard deviation, σ:
• σ, the shape parameter. Also, the standard deviation for the lognormal, this
affects the general shape of the distribution. Usually, these parameters are
known from historical data. Sometimes, you might be able to estimate it with
current data. The shape parameter doesn’t change the location or height of the
graph; it just affects the overall shape.
• m, the scale parameter (this is also the median). This parameter shrinks or
stretches the graph.
• Θ (or μ), the location parameter, which tells you where on the x-axis the graph
is located.
The standard lognormal distribution has a location parameter of 0 and a scale parameter of 1
(shown in blue in the image below). If Θ = 0, the distribution is called a 2-parameter lognormal
distribution.
• The most commonly used (and the most familiar) distribution in science is the normal
distribution. The familiar “bell curve” models many natural phenomenon, from the
simple (weights or heights) to the more complex. For example, the following
phenomenon can all be modeled with a lognormal distribution:
• Milk production by cows.
• Lives of industrial units with failure modes that are characterized by fatigue-
stress.
• Amounts of rainfall.
• Size distributions of rainfall droplets.
• The volume of gas in a petroleum reserve.
Many more phenomenon can be modeled with the lognormal distribution, such as the length
of latent periods of infectious disease or species abundance.
Logistic Distribution
• The logistic distribution is used for modeling growth, and also for logistic regression.
It is a symmetrical distribution, unimodal (it has one peak) and is similar in shape to
the normal distribution.
• In fact, the logistic and normal distributions are so close in shape (although the logistic
tends to have slightly fatter tails) that for most applications it’s impossible to tell one
from the other.
Why is it Used?
• The logistic distribution is mainly used because the curve has a relatively
simple cumulative distribution formula to work with. The formula approximates the
normal distribution extremely well.
• Finding cumulative probabilities for the normal distribution usually involves looking
up values in the z-table, rounding up or down to the nearest z-score. Exact values are
usually found with statistical software, because the cumulative distribution function is
so difficult to work with, involving integration:
• Although there are many other functions that can approximate the normal, they also
tend to have very complicated mathematical formulas. The logistic distribution, in
comparison, has a much simpler CDF formula:
• The location parameter (μ) tells you where it’s centered on the x-axis.
• The scale parameter (σ) tells you what the spread is. In the above equation, s is
a scale parameter proportional to the standard deviation.
The probability density function for the distribution is:
Maxwell-Boltzmann Distribution
• The Maxwell-Boltzmann distribution is a probability distribution mainly used in
physics and statistical mechanics.
• The distribution, which describes the kinetic energy of air particles in a particular
sample, was devised by James C. Maxwell and Ludwig Boltzmann.
• In simple terms, the kinetic energy of a particle in a system is a particle’s speed. The
area under the distribution represents the total number of particles in a sample.
Poisson Distribution
• A Poisson distribution is a tool that helps to predict the probability of certain events
happening when you know how often the event has occurred.
• It gives us the probability of a given number of events happening in a fixed interval
of time.
Poisson distributions, valid only for integers on the horizontal axis. λ (also written as μ) is the
expected number of event occurrences.
• A textbook store rents an average of 200 books every Saturday night. Using this data,
you can predict the probability that more books will sell (perhaps 300 or 400) on the
following Saturday nights.
✓ In business, overstocking will sometimes mean losses if the goods are not sold.
Likewise, having too few stocks would still mean a lost business opportunity because
you were not able to maximize your sales due to a shortage of stock. By using this tool,
businessmen are able to estimate the time when demand is unusually higher, so they
can purchase more stock.
✓ Hotels and restaurants could prepare for an influx of customers, they could hire extra
temporary workers in advance, purchase more supplies, or make contingency plans just
in case they cannot accommodate their guests coming to the area.
With the Poisson distribution, companies can adjust supply to demand in order to keep
their business earning good profit. In addition, waste of resources is prevented.
Where: