Lecture_01_Math4453(CE) (1)
Lecture_01_Math4453(CE) (1)
Lecture_01
Learning Objectives
After careful study of this lecture, student will be able to do the following:
1. State the basic concept of statistics with its importance and applications.
Introduction:
Probability and statistics are concerned with events which occur by chance. Examples include
occurrence of accidents, errors of measurements, production of defective and non-defective
items from a production line, and various games of chance, such as drawing a card from a well-
mixed deck, flipping a coin, or throwing a symmetrical six-sided die. In each case we may have
some knowledge of the likelihood of various possible results, but we cannot predict with any
certainty the outcome of any particular trial. Probability and statistics are used throughout
engineering. In electrical engineering, signals and noise are analyzed by means of probability
theory. Civil, mechanical, and industrial engineers use statistics and probability to test and
account for variations in materials and goods. Chemical engineers use probability and statistics
to assess experimental data and control and improve chemical processes. It is essential for
today’s engineer to master these tools.
(a) Probability is an area of study which involves predicting the relative likelihood of various
outcomes. It is a mathematical area which has developed over the past three or four centuries.
1
One of the early uses was to calculate the odds of various gambling games. Its usefulness for
describing errors of scientific and engineering measurements was soon realized. Engineers
study probability for its many practical uses, ranging from quality control and quality assurance
to communication theory in electrical engineering. Engineering measurements are often
analyzed using statistics, as we shall see later in this book, and a good knowledge of probability
is needed in order to understand statistics.
(b) Statistics is a word with a variety of meanings. To the man in the street it most often means
simply a collection of numbers, such as the number of people living in a country or city, a stock
exchange index, or the rate of inflation. These all come under the heading of descriptive
statistics, in which items are counted or measured and the results are combined in various ways
to give useful results. That type of statistics certainly has its uses in engineering, and we will
deal with it later, but another type of statistics will engage our attention in this book to a much
greater extent. That is inferential statistics or statistical inference. For example, it is often not
practical to measure all the items produced by a process. Instead, we very frequently take a
sample and measure the relevant quantity on each member of the sample. We infer something
about all the items of interest from our knowledge of the sample. A particular characteristic of all
the items we are interested in constitutes a population. Measurements of the diameter of all
possible bolts as they come off a production process would make up a particular population. A
sample is a chosen part of the population in question, say the measured diameters of twelve
bolts chosen to be representative of all the bolts made under certain conditions. We need to
know how reliable is the information inferred about the population on the basis of our
measurements of the sample. Perhaps we can say that “nineteen times out of twenty” the error
will be less than a certain stated limit.
2
distribution of diameters can be described by probability and statistics. If we want to improve the
quality of those bolts and to make them more uniform, we will have to look into the causes of the
variation and make changes in the raw materials or the production process. But even after that,
there will very likely be a random variation in diameter that can be described statistically.
Relations which involve chance are called probabilistic or stochastic relations. These are
contrasted with deterministic relations, in which there is no element of chance. For example,
Ohm’s Law and Newton’s Second Law involve no element of chance, so they are deterministic.
However, measurements based on either of these laws do involve elements of chance, so
relations between the measured quantities are probabilistic.
(d) Another term which requires some discussion is randomness. A random action cannot be
predicted and so is due to chance. A random sample is one in which every member of the
population has an equal likelihood of appearing. Just which items appear in the sample is
determined completely by chance. If some items are more likely to appear in the sample than
others, then the sample is not random.
Characteristics of Statistics
3
Statistics are affected to a marked extent by multiplicity of causes.
Statistics are enumerated or estimated according to a reasonable standard of accuracy.
Statistics are collected for a predetermined purpose.
Statistics are collected in a systemic manner.
Statistics must be comparable to each other.
What is the use of statistics?
1) Everybody watches weather forecasting. Have you ever think how do you get that
information? There are some computers models build on statistical concepts.
2) Statistics mostly used by the researcher. They use their statistical skills to collect the relevant
data. Otherwise, it results in a loss of money, time and data.
3) What do you understand by insurance? Everybody has some kind of insurance, whether it is
medical, home or any other insurance. Based on an individual application some businesses use
statistical models to calculate the risk of giving insurance.
4) In financial market also statistic plays a great role. Statistics are the key of how traders and
businessmen invest and make money.
5) Statistics play a big role in the medical field. Before any drugs prescribed, scientist must show
a statistically valid rate of effectiveness. Statistics are behind all the study of medical.
6) Statistical concepts are used in quality testing. Companies make many products on a daily
basis and every company should make sure that they sold the best quality items. But companies
cannot test all the products, so they use statistics sample.
7) In everyday life we make many predictions. For examples, we keep the alarm for the morning
when we don’t know that we will be alive in the morning or not. Here we use statistics basics to
make predictions.
4
8) Doctors predict disease on based on statistics concepts. Suppose a survey shows that 75%-
80% people have cancer and not able to find the reason. When the statistics become involved,
then you can have a better idea of how the cancer may affect your body or is smoking is the
major reason for it.
9) News reporter makes a prediction of winner for elections based on political campaigns. Here
statistics play a strong part in who will be your governments.
10) Statistics data allow us to collect the information around the world. The internet is a devise
which help us to collect the information. The fundamental behind the internet is based on
statistics and mathematics concepts.
Primary data
As the name suggests, are first-hand information collected by the surveyor. The data so collected
are pure and original and collected for a specific purpose. They have never undergone any
statistical treatment before. The collected data may be published as well. The survey is an
example of primary data.
Personal investigation: The surveyor collects the data himself/herself. The data so collected is
reliable but is suited for small projects.
Collection via Investigators: Trained investigators are employed to contact the respondents to
collect data.
Questionnaires: Questionnaires may be used to ask specific questions that suit the study and
get responses from the respondents. These questionnaires may be mailed as well.
Telephonic Investigation: The collection of data is done through asking questions over the
telephone to give quick and accurate information.
5
Secondary data
Secondary data are opposite to primary data. They are collected and published already (by some
organization, for instance). They can be used as a source of data and used by surveyors to
collect data from and conduct the analysis. Secondary data are impure in the sense that they
have undergone statistical treatment at least once.
6
The Creative Process
Summarizing data
Summarization is a key data mining concept which involves techniques for finding a compact
description of a dataset. Simple summarization methods such as tabulating the mean and
standard deviations are often applied for exploratory data analysis, data visualization and
automated report generation.
The ability to summarize data provides an important method for getting a grip on the meaning of
the content of a large collection of data. It enables humans to help understand their environment
in a manner amenable to future useful manipulation.
In statistics, a frequency distribution is a list, table or graph that displays the frequency of various
outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences
of values within a particular group or interval.
7
Frequency distribution of the marks of 30 students
55 48 47 53 48 33 32 42 55 44
38 60 65 71 80 41 53 47 48 55
20 31 34 42 51 35 35 55 26 25
Frequency Table/distribution
Cumulative frequency: Cumulative frequencies (c.f.) are deriving by the cumulating of the
frequencies of successive values. Cumulative frequency of a given variable or class represent
the total frequency of all previous variables including the variable of the class.
8
When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
These are values that are unusual compared to the rest of the data set by being especially small
or large in numerical value. For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is 30.7k. However, inspecting the raw data suggests that this
mean value might not be the best way to accurately reflect the typical salary of a worker, as most
workers have salaries in the 12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation, we would like to have a better measure of central tendency.
As we will find out later, taking the median would be a better measure of central tendency in this
situation.
Median
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when you have an odd
number of scores, but what happens when you have an even number of scores? What if you
had only 10 scores? Well, you simply have to take the middle two scores and average the result.
So, if we look at the example below
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89 92
9
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
most popular option. An example of a mode is presented below:
Normally, the mode is used for categorical data where we wish to know which the most
common category, as illustrated below is
10
We can see above that the most common form of transport, in this particular data set, is the bus.
However, one of the problems with the mode is that it is not unique, so it leaves us with problems
when we have two or more values that share the highest frequency, such as below:
We often test whether our data is normally distributed because this is a common assumption
underlying many statistical tests. An example of a normally distributed set of data is presented
below:
11
When you have a normally distributed sample you can logically use both the mean and the
median as your measure of central tendency. In fact, in any symmetrical distribution the mean,
median and mode are equal. However, in this situation, the mean is widely preferred as the best
measure of central tendency because it is the measure that includes all the values in the data
set for its calculation, and any change in any of the scores will affect the value of the mean. This
is not the case with the median or mode.
However, when our data is skewed, for example, as with the right-skewed data set below:
We find that the mean is being pulled in the direct of the skew. In these situations, the median
is generally considered to be the best representative of the central location of the data. The more
skewed the distribution, the greater the difference between the median and mean, and the
greater emphasis should be placed on using the median as opposed to the mean. A classic
example of the above right-skewed distribution is income (salary), where higher-earners provide
a false representation of the typical income if expressed as a mean and not a median.
Mean
Direct method of computation in valves heavy calculations and in order to avoid these, the
following formulae are generally used.
12
(a) Short-Cut method
n
fx i i
x i 1
n
f
i 1
i
d xi A; f x f Ad
i i i i
n n n
f x f Ad
i i i i fd i i
x i 1
n
i 1
n
A i 1
n
f
i 1
i f
i 1
i f
i 1
i
fi di fu i i
x A i 1
n
A h i 1
n
f
i 1
i f
i 1
i
xi A
Where ui , A being arbitrary origin and h is the interval of the class.
h
Example-1. The vibration time of a pendulum gives the following readings:
50.4, 50.2, 50.7, 49.8, 50.1, 50.3, 49.8, 50, 49.9, 50.3, 49.6
Calculate the mean time of vibration.
Solution: Take arbitrary origin A = 50. Then the frequency distribution table is
Time (xi) 50.4 50.2 50.7 49.8 50.1 50.3 49.8 50 49.9 50.3 49.6
di = xi - A 0.4 0.2 0.7 -0.2 0.1 0.3 -0.2 0 -0.1 0.3 -0.4
n
fd i i
1.1
Mean by short-cut method x A i 1
n
50 50.1
f
11
i
i 1
Example -2. The following is the frequency distribution of a random sample of weekly earnings
of 509 employees
13
Weekly earnings No. of Employees Weekly earnings No. of Employees
10 3 26 79
12 6 28 55
14 10 30 36
16 15 32 26
18 24 34 19
20 42 36 13
22 75 38 9
24 90 40 7
Calculate the average weekly earnings. In step-deviation method, the equal class interval h is
involved. In problems with group frequencies the arbitrary origin is generally taken as the value
corresponding to the maximum frequency.
Solution:
As we want to solve this problem with calculation as in the following table and take arbitrary
origin at the maximum frequency.
fx i i
13315
Mean= x i 1
n
26.16
f
509
i
i 1
14
By step-deviation method
n
fu i i
295
x Ah i 1
n
25 2 26.16
f
509
i
i 1
Median
For a grouped data the definition of median is not that straight. It is the value which divides the
total frequency into two parts.
N 2C
Median L h
f
Where L= Lower limit of class where median belongs,
N = Total frequency
f = Frequency of the median class
h = class interval of the median class
C = Cumulative frequency up to the class preceding the median class.
Mode
Defined as the value of the variable which occurs most frequently the value of max frequency
For a grouped data
∆1
Mode = 𝐿 + ×ℎ
∆1 + ∆2
Quartiles
Are those values which divide the frequency in to four equal parts, when the values are
arranged in ascending order of magnitude. The lower and upper quartiles, for the grouped
data, are calculated as follows.
15
N 4C
Q1 L h
f
3N 4 C
Q3 L h
f
Where L= Lower limit of class where lower/upper quartiles belongs
N = Total frequency
f = Frequency of the lower/upper quartiles class
h = class interval of the lower/upper quartiles class
C = Cumulative frequency up to the class preceding the lower/upper quartiles class.
1
Semi-interquartile range = Q3 Q1
2
SOME OBSERVATIONS
(i) Of the three measures means is the most important.
(ii) The median is of advantage when there are exceptionally large and small values at the end
of distributions.
(iii) Mode is misleading in distributions which are small in numbers or highly symmetrical.
Mean-Mode=3 (Mean-Median)
Example -3. Calculate Median and the lower and upper quartiles from the following distribution
of marks obtained by 49 students in a class. Find also the semi-interquartile range and the
mode.
16
Solution:
Median (which divides the total frequency into two parts i.e. 49/2=24.5) falls in the group (15-
20) and is given by
Median L
N 2C
h 15
49 2 11 5 19.5 Marks
f 15
Lower quartile Q1 (49/4 = 12.25) falls in the group (15-20) and is given by
Q1 L
N 4C
h 15
49 / 4 11 5 15.4 Marks
f 15
Upper quartile Q3 (3x49/4 = 36.75) falls in the group (25-30) and is given by
3N 4 C 36.75 36
Q3 L h 25 5 25.75 Marks
f 5
1 1
Semi-interquartile range Q3 Q1 25.75 15.4 5.175
2 2
It is seen that mode value falls in the class 15-20. Therefore for a grouped data
∆1 (15−6)
Mode = 𝐿 + ∆ = 15 + (15−6)+(15−10) × 5 = 18.2 Marks
1 +∆2
Exercise-1
1) Write a brief note on the purpose of classification of statistical data .The following are the
weights in kilograms of a group of 55 students.
42 74 40 60 82 115 41 61 75 83 63
53 110 76 84 50 67 78 77 63 65 95
68 69 104 80 79 79 54 73 59 81 100
66 49 77 90 84 76 42 64 69 70 80
72 50 79 52 103 96 51 86 78 94 71
Prepare a frequency table taking the magnitude of each class interval as ten kilograms and the
first class interval as equal to 40 and less than 50. Calculate the arithmetic mean.
[x=73.55]
17
2) (i)The mean of 20 number is 50. By mistake one number is taken as 54 instead of 45. Find
the correct mean.
(ii) From the following data prepare a frequency distribution with class interval "10" of which
one class must be 30-40 and calculate mode and median.
4 51 7 52 8 53 24 78 65 47
12 69 17 68 13 67 37 18 79 58
21 75 29 76 26 77 44 36 46 39
29 82 31 84 35 86 54 45 56 57
41 93 42 97 43 15 66 55 57 59
[Median = 50, Mode=53.75]
3) The following figure show the number of passengers carried on each of the 70 journeys by
an aircraft with a seating capacity of 100. Calculate:
(i) The average capacity used.
(ii) If 63 passengers is the smallest profitable load, the proportion of flights which were
unprofitable:
11 12 27 57 90 72 51 76 81 71
78 76 61 25 48 67 80 78 66 32
52 26 70 53 27 67 88 67 23 96
59 74 87 61 57 24 60 63 51 52
18 43 76 99 76 64 87 12 89 38
28 87 79 90 58 29 51 45 29 84
37 82 30 76 58 33 81 55 68 91
[Average capacity used 58.7%; Flights unprofitable 51.43%]
4) A liquid is sold in bottles of a normal capacity of one liter. A sample of 200 bottles is
selected at random and the capacity of each determined. The following table shows the result
in grouped frequency distribution of the capacity of bottles:
18
995.0-995.5 49 998.0-998.5 14
995.5-996.0 16 998.5-1001.0 13
996.0-996.5 15 1001.0-1002.5 11
996.5-997.0 14 2002.5-1005.0 14
997.0-997.5 10 1005.0-1010.0 8
997.5-998.0 12 1010.0-1030.0 14
116 84
(i) Calculate the arithmetic mean of the capacity of the capacity of the distribution.
(ii) What are the relative advantage and disadvantages of mean, the median and the mode of
this distribution as measures of "average" capacity?
5) Find the mean, median and modal age of married women at first child birth
Age 13 14 15 16 17 18 19 20 21 22 23 24 25
No. of 37 162 343 390 256 433 161 355 65 85 49 46 40
women
(Answer: mean = 17.175; median = 18; mode = 18)
6) The frequency distribution of weight in grams of mangoes of a given variety is given below.
Calculate the arithmetic mean, median and mode.
Weight (in gms) 410-419 420-429 430-439 440-449 450-459 460-469 470-479
No. of mangoes 14 20 42 54 45 18 7
(Answer: mean = 443.4; median = 443.94; mode = 445.21)
7) Calculate the average weight (in lbs.) from the following data:
19