Mod 2 Business Analytics
Mod 2 Business Analytics
ISSN 2229-5518 67
Glocal University,
Abstract: Enormous growth of data from diversified sources changed the complete scenario of database world. Most of the
surveys say that data is very important for all the organizations and its proper handling will seek attention in future. Various forms of
data available in the digital world need different data models for their storage, processing and analysis. This paper discusses
various kinds of data with their characteristics with examples, and also represents that the growing data is responsible for the
numerous emerging data models and database evolution.
1. Introduction:
IJSER
Big Data is a term that catches attention of everyone
today. This attention can be justified through some
surveys and facts. These surveys and facts says that Structured
each and every second we all users are creating a new Data
data which gives a addition to the rate of data
growth. Most of the web applications like Facebook,
Twitter, Instagram, Youtube are the ones which
connects with 1 billion people every day and these Data
people not only survey but share and create new data Semi- Unstructured
Structured data
every single second [1]. Survey says that the amount data
of digital universe will double in every two years [2].
Most of the organizations are working on data driven
projects [3]. Most of the organization doesn’t consider
web data as dead data where as different research
Unstructured Data
center using this data for analysis purpose and trying
to utilize it for business intelligence and pattern
prediction. Data mining and data extraction deals Data Semi-structured Data
with various algorithms to extract data so that it Growth
could help us for betterment in IT industries. Structured Data
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 12, December-2017
ISSN 2229-5518 68
2.1. Structured data: Data consist of tags and which are self-describing are
generally semi-structured data. They are different
Structured data includes mainly text, these data are from structured and unstructured data. Data object
easily processed. These data are easily entered, stored Model [11], Objects Exchange Model [11], Data
and analyzed. Structured data are stored in the form Guide[11] are famous data model that express semi-
of rows and columns which is easily managed with structured data. Concepts for semi-structured data
the a language called “structured query model: document instance, document schema,
language”(SQL)[4].Relational model[5] is a data elements attributes, elements relationship sets[11].
model that supports structured data and manage it in
the form of row and table and process the content of
the table easily. XML also
XML DOE
Support structured data. Most of the content of the
web pages are in the XML forms. These content are
included in structured data, companies like Google
Semi-structured data
uses structured data to find on the web to understand
the content of the page [6]. This way most of the
Google search is done with the help of structured
data. Since starting of the revolution of database[7] E-mails OEM
IJSER
network[8], hierarchical[9], relational, object
relational[10] data model deals with structured data.
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 12, December-2017
ISSN 2229-5518 69
3. Unstructured Data
IJSER
1. It is not based on Schema
kinds of data in upcoming and present
2. It is not suitable for relational database
scenario.
3. 90% of unstructured data is growing today
4. It includes digital media files, Word doc. References:
,pdf files,
5. It is stored in NoSQL database 1. https://www.forbes.com/sites/bernardmarr/2015/0
9/30/big-data-20-mind-boggling-facts-everyone-
must-read/#7e621bc417b1
2. https://insidebigdata.com/2017/02/16/the-
exponential-growth-of-data/
NoSQL 3. https://www.idgenterprise.com/resource/research/
2015-big-data-and-analytics-survey/
4. J. R. Groff, P. N. Weinberg SQL:The complete
reference second addition, 2002 , Mc-Graw Hills
Unstructured data Companies
5. E.F. CODD, 1970. A Relational Model of Data for
Large Shared Data Banks.
6. https://developers.google.com/search/docs/guides/
intro-structured-data
7. S. Praveen, Dr. U. Chandra, Arif ali wani , a
Audio, Videos
literature review on evolving database, IJCA,
images March 2017.
8. https://en.wikipedia.org/wiki/Network_model
9. http://www.edugrabs.com/hierarchical-model/
10. http://www.learn.geekinterview.com/it/data-
modeling/object-relational-model.html
Fig.4. Attributes of Unstructured Data
11. T.W Ling,., G. Dobbie, Semi-structured database
design,., 2005, Springer, 178,978-0-387-23567-7
12. S. Praveen , Dr. U. Chandra ,NoSQL: IT Giant
Prespectives , 2017, IJCIR
IJSER © 2017
http://www.ijser.org
UNIT 4 EXTRACT, TRANSFORM AND LOADING
4.0 Introduction
4.1 Objectives
4.2 ETL and its Need
4.2.1 Why do You Need ETL?
4.3 ETL Process
4.3.1 Data Extraction
4.3.2 Data Transformation
4.3.3 Data Loading
4.3.3.1 Types of Incremental Loads
4.3.3.2 Challenges in Incremental Loading
4.4 Working of ETL
4.4.1 Layered Implementation of ETL in a Data Warehouse
4.5 ETL and OLAP Data Warehouses
4.6 ETL Tools and their Benefits
4.7 Improving the Performance of ETL
4.8 ELT and its Need
4.8.1 Why do you Need ELT?
4.8.2 Benefits of ELT
4.8.3 ETL Vs ELT
4.9 Summary
4.10 Solutions / Answers
4.11 Further Readings
4.0 INTRODUCTION
A data warehouse is a digital storage system that connects and harmonizes large
amounts of data from many different sources. Data warehouses store current and
historical data in one place and act as the single source for an organization. A
typical data warehouse has four main components namely:
Central database: A database serves as the foundation of your data warehouse.
Traditionally, these have been standard relational databases running on premise or
in the cloud. But because of Big Data, the need for true, real-time performance, and
a drastic reduction in the cost of RAM, in-memory databases are rapidly gaining in
popularity.
Data integration: Data is pulled from source systems and modified to align the
information for rapid analytical consumption using a variety of data integration
approaches such as ETL (extract, transform, load) and ELT as well as real-time
data replication, bulk-load processing, data transformation, and data quality and
enrichment services.
Metadata: Metadata is data about your data. It specifies the source, usage, values,
and other features of the data sets in your data warehouse. There is business
metadata, which adds context to your data, and technical metadata, which describes
how to access data – including where it resides and how it is structured.
Etl, OLAP and Trends Data warehouse access tools: Access tools allow users to interact with the data in
your data warehouse. Examples of access tools include: query and reporting tools,
application development tools, data mining tools, and OLAP tools.
All these components are engineered for speed so that you can get results quickly
and analyze the data within no time.
In this unit, we will study about Data Integration component approach such as
Extract, Transform and Load (ETL) in detail.
4.1 OBJECTIVES
After going through this unit, you shall be able to:
yy understand the purpose ETL;
yy describe the ETL process, benefits and ETL tools;
yy know the complete working of the ETL;
yy discuss various layers involved in the ETL implementation;
yy to summarize the functionality of ELT, its need and benefits, and
yy to compare and contrast the ETL with ELT.
(b) Reduce - summary operations - data from the previous stage is combined.
Hadoop is optimized for distributed processing analytics. Sort and aggregate
functions execute in parallel on an entire cluster.
67
Measures of Central Tendency & Dispersion
Measures that indicate the approximate center of a distribution are called measures of central tendency.
Measures that describe the spread of the data are measures of dispersion. These measures include the mean,
median, mode, range, upper and lower quartiles, variance, and standard deviation.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
∑
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Note: Since the number of data points is odd choose the one in the very middle.
1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify the lower half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Identify the upper half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 4: For the lower half, find the median. 9, 10, 12, 13
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q1.
Step 5: For the upper half, find the median. 14, 17, 17, 20
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q3.
1. Identify the largest value in your data set. This is called the maximum.
2. Identify the lowest value in your data set. This is called the minimum.
3. Subtract the minimum from the maximum.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify your maximum. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify your minimum. 9, 10, 12, 13, 14, 14, 17, 17, 20
1. Find the mean of the data. ( if calculating for a population or ̅ if using a sample)
2. Subtract the mean ( or ̅ ) from each data value (xi ).
3. Square each calculation from Step 2.
4. Add the values of the squares from Step 3.
5. Find the number of data points in your set, called n.
6. Divide the sum from Step 4 by the number n (if calculating for a population) or n – 1(if using a
sample). This will give you the variance.
7. To find the standard deviation, square root this number.
Formulas:
Sample Variance, : Population Variance, :
∑ ̅ ∑
̅
√∑ √∑
13 – 14 = -1; 17 – 14 = 3; 12 – 14 = -2; 20 – 14 = 6; 14 – 14 = 0
Step 3: Square these values. 32 = 9; (-4)2 = 16; (-5)2 = 25; 02 = 0; (-1)2 = 1; 32 = 9; (-2)2 = 4; 62 = 36
Step 6: Square root this number to find your standard deviation. √ = 3.536
Lower Quartile
These could be
subtracted to
Median find the range.
STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Types of Probability Distribution
14.3 Concept of Random Variables
14.4 Discrete Probability Distribution
14.4.1 Binomial Distribution
14.4.2 Poisson Distribution
14.5 Continuous Probability Distribution
14.5.1 Normal Distribution
14.5.2 Characteristics of Normal Distribution
14.5.3 Importance and Application of Normal Distribution
14.6 Let Us Sum Up
14.7 Key Words
14.8 Answers to Self Assessment Exercises
14.9 Terminal Questions/Exercises
14.10 Further Reading
14.0 OBJECTIVES
After studying this unit, you should be able to:
14.1 INTRODUCTION
A probability distribution is essentially an extension of the theory of probability
which we have already discussed in the previous unit. This unit introduces the
concept of a probability distribution, and to show how the various basic
probability distributions (binomial, poisson, and normal) are constructed. All these
probability distributions have immensely useful applications and explain a wide
variety of business situations which call for computation of desired probabilities.
This means that the unity probability of a certain event is distributed over a set
of disjointed events making up a complete group. In general, a tabular recording
of the probabilities of all the possible outcomes that could result if random 2 9
Probability and (chance) experiment is done is called “Probability Distribution”. It is also
Hypothesis Testing termed as theoretical frequency distribution.
In the frequency distribution, the class frequencies add up to the total number
of observations (N), where as in the case of probability distribution the possible
outcomes (probabilities) add up to ‘one’. Like the former, a probability
distribution is also described by a curve and has its own mean, dispersion, and
skewness.
Table 14.2: Probability Distribution of the Possible No. of Heads from Two-toss
Experiment of a Fair Coin
No. of Tosses Probability of
Heads (H) outcomes P (H)
0 (T, T) 1/4 = 0.25
1 (H, T) + (T, H) 1/2 = 0.50
2 (H, H) 1/4 = 0.25
3 0
We must note that the above tables are not the real outcome of tossing a fair Probability
Distributions
coin twice. But, it is a theoretical outcome, i.e., it represents the way in which
we expect our two-toss experiment of an unbaised coin to behave over time.
The example given in the Introduction, we have seen that the outcomes of the
experiment of two-toss of a fair coin were expressed in terms of the number
of heads. We found in the example, that H (head) can assume values of 0, 1
and 2 and corresponding to each value, a probability is associated. This
uncertain real variable H, which assumes different numerical values depending
on the outcomes of an experiment, and to each of whose value a possibility
assignment can be made, is known as a random variable. The resulting
representation of all the values with their probabilities is termed as the
probability distribution of H.
H: 0 1 2
In the above situations, we have seen that the random variable takes a limited
number of values. There are certain situations where the variable under
consideration may have infinite values. Consider for example, that we are
interested in ascertaining the probability distribution of the weight of one kg.
coffee packs. We have reasons to believe that the packing process is such that
a certain percentage of the packs slightly below one kg., and some packs are
above one kg. It is easy to see that it is essentially by chance that the pack
will weigh exactly 1 kg., and there are an infinite number of values that the
random variable ‘weight’ can take. In such cases, it makes sense to talk of the
probability that the weight will be between two values, rather than the
probability of the weight taking any specific value. These types of random
variables which can take an infinitely large number of values are called
continuous random variables, and the resulting distribution is called a
continuous probability distribution. The function that specifies the probability
distribution of a continuous random variable is called the probability density
function (p.d.f.).
Table 14.4
100 0.3 30
110 0.6 66
120 0.1 12
Now, we will examine situations involving discrete random variables and discuss
3 2 the methods for assessing them.
Probability
14.4 DISCRETE PROBABILITY DISTRIBUTION Distributions
It is the basic and the most common probability distribution. It has been used to
describe a wide variety of processes in business. For example, a quality control
manager wants to know the probability of obtaining defective products in a
random sample of 10 products. If 10 per cent of the products are defective, he/
she can quickly obtain the answer, from tables of the binomial probability
distributions. It is also known as Bernoulli Distribution, as it was originated
by Swiss Mathematician James Bernoulli (1654-1705).
The binomial distribution describes discrete, not continuous, data resulting from
an experiment known as Bernoulli Process. Binomial distribution is a probability
distribution expressing the probability of one set of dichotomous alternatives, i.e.,
success or failure.
c) The trials are mutually independent i.e., the outcome of any trial is neither
affected by others nor affects others.
Assumptions i) Each trial has only two possible outcomes either Yes or No,
success or failure, etc.
ii) Regardless of how many times the experiment is performed, the probability of
the outcome, each time, remains the same.
Hence the following form of the equations, for carrying out computations of the
binomial probability is perhaps more convenient.
n!
P( r ) = prqn – r
r! (n − r ) !
If n is large in number, say, 50C3, then we can write (with the help of the
above explanation)
50 × 49 × 48
=
3× 2 ×1
Similarly,
75 × 74 × 73 × 72 × 71
= , and so on.
5× 4 × 3× 2 ×1
Illustration 1
A fair coin is tossed six times. What is the probability of obtaining four or more
heads?
Solution: When a fair coin is tossed, the probabilities of head and tail in case
3 4 of an unbiased coin are equal, i.e.,
p = q = ½ or 0.5 Probability
Distributions
n!
P( r ) = prqn −r
r! (n − r ) !
6!
P (4) = (0.5) 4 (0.5) 2
4 ! (6 − 4) !
6 × 5× 4 × 3× 2 ×1
= (0.625) (0.25)
(4 × 3 × 2 × 1) (2 × 1)
720
= (0.625) (0.25) = 15 × 0.625 × 0.25
(24) (2)
= 0.234
The probability of obtaining 5 heads is :
P(5) = 6C5(1/2)5 (1/2)6-5
6!
P (5) = (0.5) 5 (0.5)1
5 ! (6 − 5) !
6 × 5 × 4 × 3 × 2 × 1
= (0.03125) (0.5)
5 × 4 × 3 × 2 × 1 (1 × 1)
= 6 × (0.03125) (0.5)
= 0.094
The probability of obtaining 6 heads is : P(6) = 6C6 (1/2)6 (1/2)6-6
6!
P (6) = (0.5) 2 (0.5) 0
6 ! (6 − 6) !
6 × 5 × 4 × 3 × 2 × 1
= (0.015625) (1)
6 × 5 × 4 × 3 × 2 × 1 (1)
= 1 × 0.015625 × 1
= 0.016
∴ The probability of obtaining 4 or more heads is :
0.234 + 0.094 + 0.016 = 0.344
Illustration 2
1 4
q = 1 − =
5 5
By binomial probability law, the probability that out of 10 workers, ‘x’ workers
suffer from a disease is given by:
P(r) = nCr pr qn–r
10 − r
10 C . 1 r. 4 ; r = 0, 1, 2, …10
r
5 5
i) The required probability that exactly 2 workers will suffer from the disease is
given by :
2 10 − 2
1 4
P(2) = 10C 2
5 5
ii) The required probability that not more than 2 workers will suffer from the
disease is given by :
0 10 − 0
1 4
P (0) = 10C 0 = 0.107
5 5
1 10 −1
1 4
P (1) = 10C1 = 0.269
5 5
2 10 − 2
1 4
P (2) = 10 C 2 = 0.302
5 5
3 6
We can represent the mean of the binomial distribution as : Probability
Distributions
Mean (µ) = np.
where, n = Number of trials; p = probability of success
And, we can calculate the standard deviation by :
σ = npq
where, n = Number of trials; p = probability of success; and q = probability of
failure = 1–p
Illustration 3
If the probability of defective bolts is 0.1, find the mean and standard deviation
for the distribution of defective bolts in a total of 50.
∴ σ = 500 × .1 × .9 = 6.71
i) Determine the values of ‘p’ and ‘q’. If one of these values is known, the other
can be found out by the simple relationship p = 1–q and q = 1–p. If p and q are
equal, we can say, the distribution is symmetrical. On the other hand if ‘p’ and
‘q’ are not equal, the distribution is skewed. The distribution is positively
skewed, in case ‘p’ is less than 0.5, otherwise it is negatively skewed.
ii) Expand the binomial (p + q)n. The power ‘n’ is equal to one less than the
number of terms in the expanded binomial. For example, if 3 coins are tossed
(n = 3) there will be four terms, when 5 coins are tossed (n = 5) there will be 6
terms, and so on.
iii) Multiply each term of the expanded binomial by N (the total frequency), in
order to obtain the expected frequency in each category.
Let us consider an illustration for fitting a binomial distribution.
Illustration 4
Eight coins are tossed at a time 256 times. Number of heads observed at each
throw is recorded and the results are given below. Find the expected
frequencies. What are the theoretical values of mean and standard deviation?
Also calculate the mean and standard deviation of the observed frequencies.
1 1
= 8× × = 2 = 1.414
2 2
Note: The procedure for computation of mean and standard deviation of the
observed frequencies has been already discussed in Units 8 and 9 of this
course. Check these values by computing on your own.
3) The following data shows the result of the experiment of throwing 5 coins at a
time 3,100 times and the number of heads appearing in each throw. Find the
expected frequencies and comment on the results. Also calculate mean and
standard deviation of the theoretical values.
No. of heads: 0 1 2 3 4 5
frequency: 32 225 710 1,085 820 228
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
This would comparatively be simpler in dealing with and is given by the Poisson
distribution formula as follows:
mre–m
p (r ) = ,
r!
where, p (r) = Probability of successes desired
c) It consists of a single parameter “m” only. So, the entire distribution can be
obtained by knowing this value only.
In poisson distribution, the mean (m) and the variance (s2) represent the same
value, i.e.,
Mean = variance = np = m
Since, n is large and p is small, the poisson distribution is applicable. Apply the
formula:
mre–m
p (r ) =
r!
m 5e – m
p (5) = , where m = np = 200 × 0.02 = 4;
5!
e = 2.7183 (constant)
1
5 − 4 (1024)
∴ P (5) =
4 2.7183
= 2.71834
5 × 4 × 3 × 2 ×1 120
(1024) 0.0183
= = 0.156
120
Illustration 6
m 4 e−m
P (4) = , where, m = np = 30 (0.02) = 0.6
4!
e = 2.7183 (constant)
Illustration 7
0.439 r × e –0.439
We can write P( r ) = . Substituting r = 0, 1, 2, 3, and 4, we get
r!
the probabilities for various values of r, as shown below:
m r e −m 0.439 0 × 2.7183−0.439
(P0) = =
r! 0!
1 (0.6443)
= = 0.6443
1
N(P0) = (P0) × N = 0.6443 × 330 = 212.62
4 2
Thus, the expected frequencies as per poisson distribution are : Probability
Distributions
No. of defects (x) 0 1 2 3 4
Expected
frequencies (No. 212.62 93.34 20.49 3.0 0.33
of units) (f)
Note: We can use Appendix Table-2, given at the end of this block, to
determine poisson probabilities quickly.
3) Four hundred car air-conditioners are inspected as they come off the
production line and the number of defects per set is recorded below. Find
the expected frequencies by assuming the poisson model.
No. of defects : 0 1 2 3 4 5
4 3
Probability and
Hypothesis Testing 14.5 CONTINUOUS PROBABILITY DISTRIBUTION
In the previous sections, we have examined situations involving discrete random
variables and the resulting probability distributions. Let us now consider a
situation, where the variable of interest may take any value within a given
range. Suppose that we are planning to release water for hydropower
generation and irrigation. Depending on how much water we have in the
reservoir, viz., whether it is above or below the ‘normal’ level, we decide on
the quantity of water and time of its release. The variable indicating the
difference between the actual level and the normal level of water in the
reservoir, can take positive or negative values, integer or otherwise. Moreover,
this value is contingent upon the inflow to the reservoir, which in turn is
uncertain. This type of random variable which can take an infinite number of
values is called a continuous random variable, and the probablity distribution
of such a variable is called a continuous probability distribution.
Now we present one important probability density function (p.d.f), viz., the
normal distribution.
The normal distribution is the most versatile of all the continuous probability
distributions. It is useful in statistical inferences, in characterising uncertainities
in many real life situations, and in approximating other probability distributions.
As stated earlier, the normal distribution is suitable for dealing with variables
whose magnitudes are continuous. Many statistical data concerning business
problems are displayed in the form of normal distribution. Height, weight and
dimensions of a product are some of the continuous random variables which are
found to be normally distributed. This knowledge helps us in calculating the
probability of different events in varied situations, which in turn is useful for
decision-making.
Now we turn to examine the characteristics of normal distribution with the help
of the figure 14.1, and explain the methods of calculating the probability of
different events using the distribution.
Mean
Median
Mode
Normal probability distribution
is symmetrical around a vertical
line erected at the mean
Left hand tail extends
indefinitely but never Right hand tail extends
reaches the horizontal indefinitely but never
axis reaches the horizontal
axis
1) The curve has a single peak, thus it is unimodal i.e., it has only one mode and
has a bellshape.
3) The two tails of the normal probability distribution extend indefinitely but never
touch the horizontal axis.
Irrespective of the value of mean (µ) and standard deviation (σ), for a normal
distribution, the total area under the curve is 1.00. The area under the normal
curve is approximately distributed by its standard deviation as follows:
µ±1σ covers 68% area, i.e., 34.13% area will lie on either side of µ.
X−µ
Z=
σ
Where,
Step 2: Look up the probability of z value from the Appendix Table-3, given at
the end of this block, of normal curve areas. This Table is set up to
provide the area under the curve to any specified value of Z. (The
area under the normal curve is equal to 1. The curve is also called
the standard probability curve).
4 5
Probability and Let us consider the following illustration to understand as to how the table
Hypothesis Testing should be consulted in order to find the area under the normal curve.
Illustration 8
(a) Find the area under the normal curve for Z = 1.54.
Solution: Consulting the Appendix Table-3 given at the end of this block, we
find the entry corresponding to Z = 1.54 the area is 0.4382 and this measures
the Shaded area between Z = 0 and Z = 1.54 as shown in the following figure.
0.4382
µ 1.54
Solution: Since the curve is symmetrical, we can obtain the area between z =
–1.46 and Z = 0 by considering the area corresponding to Z = 1.46. Hence,
when we look at Z of 1.46 in Appendix Table-3 given at the end of this block,
we see the probability value of 0.4279. This value is also the probability value
of Z = –1.46 which must be shaded on the left of the µ as shown in the
following figure.
0.4279
-1.46 µ
4 6
0.987 Probability
Distributions
0.4013
µ 0.25
d) Find the area to the left of Z = 1.83.
Solution: If we are interested in finding the area to the left of Z (positive
value), we add 0.5000 to the table value given for Z. Here, the table value for
Z (1.83) = 0.4664. Therefore, the total area to the left of Z = 0.9664 (0.5000 +
0.4664) i.e., equal to the shaded area as shown below:
5.000
0.4664
µ 1.83
X−µ
Solution: Z=
σ
X = 72 inches; µ = 68.22 inches; and σ = 10.8 = 3.286
72 − 68 .22
∴Z = = 1.15
3.286
4 7
Probability and
Hypothesis Testing
0.3749
0.1251
68.22
µ 72
Area to the right of the ordinate at 1.16 from the normal table is (0.5–0.3749)
= 0.1251. Hence, the probability of getting soldiers above six feet is 0.1251 and
out of 1,000 soliders, the expectation is 1,000 × 0.1251 = 125.1 or 125. Thus,
the expected number of soldiers over six feet tall is 125.
Illustration 10
(a) 15,000 students appeared for an examination. The mean marks were 49 and the
standard deviation of marks was 6. Assuming the marks to be normally
distributed, what proportion of students scored more than 55 marks?
X−µ
Solution: Z =
σ
X = 55; µ = 49; σ = 6
55 − 49
∴Z = =1
6
For Z = 1, the area is 0.3413 (as per Appendix Table-3).
(b) If in the same examination, Grade ‘A’ is to be given to students scoring more
than 70 marks, what proportion of students will receive grade ‘A’?
X−µ
Solution: Z =
σ
X = 70; µ = 49; σ = 6
70 − 49
∴Z = = 3.5
6
The table gives the area under the standard normal curve corresponding to
Z = 3.5 is 0.4998
4 8
Illustration 11 Probability
Distributions
In a training programme (self-administered) to develop marketing skills of marketing
personnel of a company, the participants indicate that the mean time on the
programme is 500 hours and that this normally distributed random variable has a
standard deviation of 100 hours. Find out the probability that a participant selected
at random will take:
i) fewer than 570 hours to complete the programme, and
ii) between 430 and 580 hours to complete the programme.
Solution: (i) To get the Z value for the probability that a candidate selected at
random will take fewer than 570 hours, we have
x −µ 570 − 500
Z = =
σ 100
70
= = 0.7
100
0.2580
P( less than 570) Z= 0.7
= 0.7580
(µ) 570
Thus, the probability of a participant taking less than 570 hours to complete the
programme, is marginally higher than 75 per cent.
ii) In order to get the probability, of a participant chosen at random, that he will take
between 430 and 580 hours to complete the programme, we must, first, compute
the Z value for 430 and 580 hours.
x −µ
Z=
σ
580 − 500 80
Z for 580 = = = 0 .8
100 100 4 9
Probability and The table shows the probability values of Z values of –0.7 and 0.8 are 0.2580
Hypothesis Testing and 0.2881 respectively. This situation is shown in the following figure.
Z= –0.8
Z= –0.7
Thus, the probability that the random variables lie between 430 and 580 hours is
0.5461 (0.2580 + 0.2881).
(iii) To fit sampling distribution of various statistics like mean or variance etc.
Probability: Any numerical value between 0 and 1 both inclusive, telling about
the likelihood of occurrence of an event.
Probability Distribution: A curve that shows all the values that the random
variable can take and the likelihood that each will occur.
3) Define a binomial probability distribution. State the conditions under which the
binomial probability model is appropriate by illustrations.
If we have two events A and B, and we are given the conditional probability of A given B, denoted
P(A|B), we can use Bayes’ Theorem to find P(B|A), the conditional probability of B given A.
P(A|B)P(B)
Bayes’ Theorem: P(B|A) =
P(A|B)P(B) + P(A|B0 )P(B0 )
Example:
Q: In a factory there are two machines manufacturing bolts. The first machine manufactures 75%
of the bolts and the second machine manufactures the remaining 25%. From the first machine 5%
of the bolts are defective and from the second machine 8% of the bolts are defective. A bolt is se-
lected at random, what is the probability the bolt came from the first machine, given that it is defective?
A:
Let A be the event that a bolt is defective and let B be the event that a bolt came from Machine 1.
Check that you can see where these probabilites come from!
P(B) = 0.75 P(B0 ) = 0.25 P(A|B) = 0.05 P(A|B0 ) = 0.08
Now, use Bayes’ Theorem to find the required probability:
P(A|B)P(B)
P(B|A) =
P(A|B)P(B) + P(A|B0 )P(B0 )
0.05 × 0.75
=
0.05 × 0.75 + 0.08 × 0.25
= 0.3846
Try this:
Exercise: Among a group of male pensioners, 10% are smokers and 90% are nonsmokers. The proba-
bility of a smoker dying in the next year is 0.05 while the probability for a nonsmoker is 0.005. Given
one of these pensioners dies in the next year, what is the probability that he is a smoker?