0% found this document useful (0 votes)
5 views

Lecture_01_Math4453(CE) (1)

Uploaded by

Rajkumaar Irrfan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture_01_Math4453(CE) (1)

Uploaded by

Rajkumaar Irrfan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Statistics and Probability

Lecture_01

Measures of central tendency

Learning Objectives

After careful study of this lecture, student will be able to do the following:

1. State the basic concept of statistics with its importance and applications.

2. Summarize the collected raw data.

3. Calculate the different central values for sample data.

4. Investigate the symmetry of the collected data by using central values.

5. Identify the functions of statistics in engineering data analysis.

Introduction:
Probability and statistics are concerned with events which occur by chance. Examples include
occurrence of accidents, errors of measurements, production of defective and non-defective
items from a production line, and various games of chance, such as drawing a card from a well-
mixed deck, flipping a coin, or throwing a symmetrical six-sided die. In each case we may have
some knowledge of the likelihood of various possible results, but we cannot predict with any
certainty the outcome of any particular trial. Probability and statistics are used throughout
engineering. In electrical engineering, signals and noise are analyzed by means of probability
theory. Civil, mechanical, and industrial engineers use statistics and probability to test and
account for variations in materials and goods. Chemical engineers use probability and statistics
to assess experimental data and control and improve chemical processes. It is essential for
today’s engineer to master these tools.

Some Important Terms

(a) Probability is an area of study which involves predicting the relative likelihood of various
outcomes. It is a mathematical area which has developed over the past three or four centuries.

1
One of the early uses was to calculate the odds of various gambling games. Its usefulness for
describing errors of scientific and engineering measurements was soon realized. Engineers
study probability for its many practical uses, ranging from quality control and quality assurance
to communication theory in electrical engineering. Engineering measurements are often
analyzed using statistics, as we shall see later in this book, and a good knowledge of probability
is needed in order to understand statistics.

(b) Statistics is a word with a variety of meanings. To the man in the street it most often means
simply a collection of numbers, such as the number of people living in a country or city, a stock
exchange index, or the rate of inflation. These all come under the heading of descriptive
statistics, in which items are counted or measured and the results are combined in various ways
to give useful results. That type of statistics certainly has its uses in engineering, and we will
deal with it later, but another type of statistics will engage our attention in this book to a much
greater extent. That is inferential statistics or statistical inference. For example, it is often not
practical to measure all the items produced by a process. Instead, we very frequently take a
sample and measure the relevant quantity on each member of the sample. We infer something
about all the items of interest from our knowledge of the sample. A particular characteristic of all
the items we are interested in constitutes a population. Measurements of the diameter of all
possible bolts as they come off a production process would make up a particular population. A
sample is a chosen part of the population in question, say the measured diameters of twelve
bolts chosen to be representative of all the bolts made under certain conditions. We need to
know how reliable is the information inferred about the population on the basis of our
measurements of the sample. Perhaps we can say that “nineteen times out of twenty” the error
will be less than a certain stated limit.

(c) Chance is a necessary part of any process to be described by probability or statistics.


Sometimes that element of chance is due partly or even perhaps entirely to our lack of
knowledge of the details of the process. For example, if we had complete knowledge of the
composition of every part of the raw materials used to make bolts, and of the physical processes
and conditions in their manufacture, in principle we could predict the diameter of each bolt. But
in practice we generally lack that complete knowledge, so the diameter of the next bolt to be
produced is an unknown quantity described by a random variation. Under these conditions the

2
distribution of diameters can be described by probability and statistics. If we want to improve the
quality of those bolts and to make them more uniform, we will have to look into the causes of the
variation and make changes in the raw materials or the production process. But even after that,
there will very likely be a random variation in diameter that can be described statistically.

Relations which involve chance are called probabilistic or stochastic relations. These are
contrasted with deterministic relations, in which there is no element of chance. For example,
Ohm’s Law and Newton’s Second Law involve no element of chance, so they are deterministic.
However, measurements based on either of these laws do involve elements of chance, so
relations between the measured quantities are probabilistic.

(d) Another term which requires some discussion is randomness. A random action cannot be
predicted and so is due to chance. A random sample is one in which every member of the
population has an equal likelihood of appearing. Just which items appear in the sample is
determined completely by chance. If some items are more likely to appear in the sample than
others, then the sample is not random.

Characteristics of Statistics

Some of the important characteristics of statistics are given below:

 Statistics are aggregates of facts.


 Statistics are numerically expressed.

3
 Statistics are affected to a marked extent by multiplicity of causes.
 Statistics are enumerated or estimated according to a reasonable standard of accuracy.
 Statistics are collected for a predetermined purpose.
 Statistics are collected in a systemic manner.
 Statistics must be comparable to each other.
What is the use of statistics?

 Statistics helps in providing a better understanding and exact description of a


phenomenon of nature.
 Statistics helps in the proper and efficient planning of a statistical inquiry in any field of
study.
 Statistics helps in collecting appropriate quantitative data.
Why is statistics important in everyday life?

Following are Remarkable Reasons Why Statistics are Important

1) Everybody watches weather forecasting. Have you ever think how do you get that
information? There are some computers models build on statistical concepts.

2) Statistics mostly used by the researcher. They use their statistical skills to collect the relevant
data. Otherwise, it results in a loss of money, time and data.

3) What do you understand by insurance? Everybody has some kind of insurance, whether it is
medical, home or any other insurance. Based on an individual application some businesses use
statistical models to calculate the risk of giving insurance.

4) In financial market also statistic plays a great role. Statistics are the key of how traders and
businessmen invest and make money.

5) Statistics play a big role in the medical field. Before any drugs prescribed, scientist must show
a statistically valid rate of effectiveness. Statistics are behind all the study of medical.

6) Statistical concepts are used in quality testing. Companies make many products on a daily
basis and every company should make sure that they sold the best quality items. But companies
cannot test all the products, so they use statistics sample.

7) In everyday life we make many predictions. For examples, we keep the alarm for the morning
when we don’t know that we will be alive in the morning or not. Here we use statistics basics to
make predictions.

4
8) Doctors predict disease on based on statistics concepts. Suppose a survey shows that 75%-
80% people have cancer and not able to find the reason. When the statistics become involved,
then you can have a better idea of how the cancer may affect your body or is smoking is the
major reason for it.

9) News reporter makes a prediction of winner for elections based on political campaigns. Here
statistics play a strong part in who will be your governments.

10) Statistics data allow us to collect the information around the world. The internet is a devise
which help us to collect the information. The fundamental behind the internet is based on
statistics and mathematics concepts.

Types of Data and Data Collection

There are two types of data: primary and secondary.

Primary data

As the name suggests, are first-hand information collected by the surveyor. The data so collected
are pure and original and collected for a specific purpose. They have never undergone any
statistical treatment before. The collected data may be published as well. The survey is an
example of primary data.

Methods of primary data collection:

Personal investigation: The surveyor collects the data himself/herself. The data so collected is
reliable but is suited for small projects.

Collection via Investigators: Trained investigators are employed to contact the respondents to
collect data.

Questionnaires: Questionnaires may be used to ask specific questions that suit the study and
get responses from the respondents. These questionnaires may be mailed as well.

Telephonic Investigation: The collection of data is done through asking questions over the
telephone to give quick and accurate information.

5
Secondary data

Secondary data are opposite to primary data. They are collected and published already (by some
organization, for instance). They can be used as a source of data and used by surveyors to
collect data from and conduct the analysis. Secondary data are impure in the sense that they
have undergone statistical treatment at least once.

Methods of secondary data collection:

1) Official publications such as the Ministry of Finance, Statistical Departments of the


government, Federal Bureaus, Agricultural Statistical boards, etc. Semi-official sources
include State Bank, Boards of Economic Enquiry, etc.
2) Data published by Chambers of Commerce and trade associations and boards.
3) Articles in the newspaper, from journals and technical publications
The Engineering Method and Statistical Thinking:

The steps in the engineering method are as follows:


1. Develop a clear and concise description of the problem.
2. Identify, at least tentatively, the important factors that affect this problem or that may play a
role in its solution.
3. Propose a model for the problem, using scientific or engineering knowledge of the
phenomenon being studied. State any limitations or assumptions of the model.
4. Conduct appropriate experiments and collect data to test or validate the tentative model or
conclusions made in steps 2 and 3.
5. Reined the model on the basis of the observed data.
6. Manipulate the model to assist in developing a solution to the problem.
7. Conduct an appropriate experiment to confirm that the proposed solution to the problem is
both effective and efficient.
8. Draw conclusions or make recommendations based on the problem solution.
The steps in the engineering method are shown in Fig. 1-1.

6
The Creative Process

Figure-1-1: The Engineering Method

Summarizing data

What is summarization of data?

Summarization is a key data mining concept which involves techniques for finding a compact
description of a dataset. Simple summarization methods such as tabulating the mean and
standard deviations are often applied for exploratory data analysis, data visualization and
automated report generation.

How do you summarize data in statistics?


You need to present the first three summary statistics in order to summarize a set of numbers
adequately.
Averages
Mean – the arithmetic mean, the sum of the values divided by the replication.
Median – the middle value when all the numbers are ranked in order.
Mode – the most frequent value(s) in a sample.
Why is it important to summarize data?

The ability to summarize data provides an important method for getting a grip on the meaning of
the content of a large collection of data. It enables humans to help understand their environment
in a manner amenable to future useful manipulation.

The concept of frequency distribution is introduced as a tabular method of summarizing data.

In statistics, a frequency distribution is a list, table or graph that displays the frequency of various
outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences
of values within a particular group or interval.

7
Frequency distribution of the marks of 30 students

55 48 47 53 48 33 32 42 55 44
38 60 65 71 80 41 53 47 48 55
20 31 34 42 51 35 35 55 26 25

Frequency Table/distribution

Class Tally marks Frequency Cumulative frequency


interval
10-20 - 0 0
20-30 III 3 3
30-40 VII 7 10
40-50 VV 10 20
50-60 VI 6 26
60-70 II 2 28
70-80 II 2 30

Cumulative frequency: Cumulative frequencies (c.f.) are deriving by the cumulating of the
frequencies of successive values. Cumulative frequency of a given variable or class represent
the total frequency of all previous variables including the variable of the class.

Measures of Central Tendency


These measures indicate where most values in a distribution fall and are also referred to as the
central location of a distribution. You can think of it as the tendency of data to cluster around a
middle value. In statistics, the three most common measures of central tendency are the mean,
median, and mode.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can be used
with both discrete and continuous data, although its use is most often with continuous data. The mean is
equal to the sum of all the values in the data set divided by the number of values in the data set. The
mean of a population is represented by the Greek letter ; the mean of a sample is represented
by 𝑋̅.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.

8
When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
These are values that are unusual compared to the rest of the data set by being especially small
or large in numerical value. For example, consider the wages of staff at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

The mean salary for these ten staff is 30.7k. However, inspecting the raw data suggests that this
mean value might not be the best way to accurately reflect the typical salary of a worker, as most
workers have salaries in the 12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation, we would like to have a better measure of central tendency.
As we will find out later, taking the median would be a better measure of central tendency in this
situation.

Median

The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when you have an odd
number of scores, but what happens when you have an even number of scores? What if you
had only 10 scores? Well, you simply have to take the middle two scores and average the result.
So, if we look at the example below

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

9
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.

Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
most popular option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which the most
common category, as illustrated below is

10
We can see above that the most common form of transport, in this particular data set, is the bus.
However, one of the problems with the mode is that it is not unique, so it leaves us with problems
when we have two or more values that share the highest frequency, such as below:

Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed because this is a common assumption
underlying many statistical tests. An example of a normally distributed set of data is presented
below:

11
When you have a normally distributed sample you can logically use both the mean and the
median as your measure of central tendency. In fact, in any symmetrical distribution the mean,
median and mode are equal. However, in this situation, the mean is widely preferred as the best
measure of central tendency because it is the measure that includes all the values in the data
set for its calculation, and any change in any of the scores will affect the value of the mean. This
is not the case with the median or mode.

However, when our data is skewed, for example, as with the right-skewed data set below:

We find that the mean is being pulled in the direct of the skew. In these situations, the median
is generally considered to be the best representative of the central location of the data. The more
skewed the distribution, the greater the difference between the median and mean, and the
greater emphasis should be placed on using the median as opposed to the mean. A classic
example of the above right-skewed distribution is income (salary), where higher-earners provide
a false representation of the typical income if expressed as a mean and not a median.

Mean

Direct method of computation in valves heavy calculations and in order to avoid these, the
following formulae are generally used.

12
(a) Short-Cut method
n

fx i i
x i 1
n

f
i 1
i

d  xi  A;  f x   f Ad 
i i i i

n n n

 f x  f Ad 
i i i i fd i i
x i 1
n
 i 1
n
 A i 1
n

f
i 1
i f
i 1
i f
i 1
i

(b) Step-deviation method


di
ui  ; di  ui h
h
n n

 fi di  fu i i
x  A i 1
n
 A h i 1
n

f
i 1
i f
i 1
i

xi  A
Where ui  , A being arbitrary origin and h is the interval of the class.
h
Example-1. The vibration time of a pendulum gives the following readings:
50.4, 50.2, 50.7, 49.8, 50.1, 50.3, 49.8, 50, 49.9, 50.3, 49.6
Calculate the mean time of vibration.
Solution: Take arbitrary origin A = 50. Then the frequency distribution table is
Time (xi) 50.4 50.2 50.7 49.8 50.1 50.3 49.8 50 49.9 50.3 49.6
di = xi - A 0.4 0.2 0.7 -0.2 0.1 0.3 -0.2 0 -0.1 0.3 -0.4
n

fd i i
1.1
Mean by short-cut method x  A  i 1
n
 50   50.1
f
11
i
i 1

Example -2. The following is the frequency distribution of a random sample of weekly earnings
of 509 employees

13
Weekly earnings No. of Employees Weekly earnings No. of Employees
10 3 26 79
12 6 28 55
14 10 30 36
16 15 32 26
18 24 34 19
20 42 36 13
22 75 38 9
24 90 40 7

Calculate the average weekly earnings. In step-deviation method, the equal class interval h is
involved. In problems with group frequencies the arbitrary origin is generally taken as the value
corresponding to the maximum frequency.
Solution:
As we want to solve this problem with calculation as in the following table and take arbitrary
origin at the maximum frequency.

Weekly earning Mid-value(xi) No. of Employess (fi) (fi. xi) ui fi. ui


10-12 11 3 33 -7 -21
12-14 13 6 78 -6 -36
14-16 15 10 150 -5 -50
16-18 17 15 255 -4 -60
18-20 19 24 456 -3 -72
20-22 21 42 442 -2 -84
22-24 23 75 1725 -1 -75
24-26 25 90 2250 0 0
26-28 27 79 2133 1 79
28-30 29 55 1595 2 110
30-32 31 36 1116 3 108
32-34 33 26 858 4 104
34-36 35 19 665 5 95
36-38 37 13 481 6 78
38-40 39 9 351 7 63
40-42 41 7 287 8 56
∑ 𝑓𝑖 = 509 ∑ 𝑓𝑖 𝑥𝑖 ∑ 𝑓𝑖 𝑢𝑖
= 13315 = 295

fx i i
13315
Mean= x  i 1
n
  26.16
f
509
i
i 1

14
By step-deviation method
n

 fu i i
295
x  Ah i 1
n
 25  2   26.16
f
509
i
i 1
Median

For a grouped data the definition of median is not that straight. It is the value which divides the
total frequency into two parts.

For a grouped data

N 2C
Median  L  h
f
Where L= Lower limit of class where median belongs,
N = Total frequency
f = Frequency of the median class
h = class interval of the median class
C = Cumulative frequency up to the class preceding the median class.
Mode
Defined as the value of the variable which occurs most frequently the value of max frequency
For a grouped data
∆1
Mode = 𝐿 + ×ℎ
∆1 + ∆2

Where L= Lower limit of the class containing mode


∆1 = Excess of modal frequency over frequency of the preceding class.
∆2 = Excess of modal frequency over frequency of the following class
h = Size of modal class

Quartiles
Are those values which divide the frequency in to four equal parts, when the values are
arranged in ascending order of magnitude. The lower and upper quartiles, for the grouped
data, are calculated as follows.

15
N 4C
Q1  L  h
f
3N 4  C
Q3  L  h
f
Where L= Lower limit of class where lower/upper quartiles belongs
N = Total frequency
f = Frequency of the lower/upper quartiles class
h = class interval of the lower/upper quartiles class
C = Cumulative frequency up to the class preceding the lower/upper quartiles class.

1
Semi-interquartile range =   Q3  Q1 
2
SOME OBSERVATIONS
(i) Of the three measures means is the most important.

(ii) The median is of advantage when there are exceptionally large and small values at the end
of distributions.

(iii) Mode is misleading in distributions which are small in numbers or highly symmetrical.

(iv) In a symmetrical distribution mean, median and mode coincide.

(v) For a non-symmetrical distribution they are different and connected by

Mean-Mode=3 (Mean-Median)

Example -3. Calculate Median and the lower and upper quartiles from the following distribution
of marks obtained by 49 students in a class. Find also the semi-interquartile range and the
mode.

Marks 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45


No. of students 5 6 15 10 5 4 2 2

16
Solution:
Median (which divides the total frequency into two parts i.e. 49/2=24.5) falls in the group (15-
20) and is given by

Median  L 
N 2C
 h  15 
 49 2  11  5  19.5 Marks
f 15
Lower quartile Q1 (49/4 = 12.25) falls in the group (15-20) and is given by

Q1  L 
N 4C
 h  15 
 49 / 4  11  5  15.4 Marks
f 15
Upper quartile Q3 (3x49/4 = 36.75) falls in the group (25-30) and is given by

3N 4  C 36.75  36
Q3  L   h  25   5  25.75 Marks
f 5
1 1
Semi-interquartile range   Q3  Q1    25.75  15.4   5.175
2 2
It is seen that mode value falls in the class 15-20. Therefore for a grouped data
∆1 (15−6)
Mode = 𝐿 + ∆ = 15 + (15−6)+(15−10) × 5 = 18.2 Marks
1 +∆2

Exercise-1
1) Write a brief note on the purpose of classification of statistical data .The following are the
weights in kilograms of a group of 55 students.

42 74 40 60 82 115 41 61 75 83 63
53 110 76 84 50 67 78 77 63 65 95
68 69 104 80 79 79 54 73 59 81 100
66 49 77 90 84 76 42 64 69 70 80
72 50 79 52 103 96 51 86 78 94 71
Prepare a frequency table taking the magnitude of each class interval as ten kilograms and the
first class interval as equal to 40 and less than 50. Calculate the arithmetic mean.

[x=73.55]

17
2) (i)The mean of 20 number is 50. By mistake one number is taken as 54 instead of 45. Find
the correct mean.

(ii) From the following data prepare a frequency distribution with class interval "10" of which
one class must be 30-40 and calculate mode and median.

4 51 7 52 8 53 24 78 65 47
12 69 17 68 13 67 37 18 79 58
21 75 29 76 26 77 44 36 46 39
29 82 31 84 35 86 54 45 56 57
41 93 42 97 43 15 66 55 57 59
[Median = 50, Mode=53.75]

3) The following figure show the number of passengers carried on each of the 70 journeys by
an aircraft with a seating capacity of 100. Calculate:
(i) The average capacity used.
(ii) If 63 passengers is the smallest profitable load, the proportion of flights which were
unprofitable:

11 12 27 57 90 72 51 76 81 71
78 76 61 25 48 67 80 78 66 32
52 26 70 53 27 67 88 67 23 96
59 74 87 61 57 24 60 63 51 52
18 43 76 99 76 64 87 12 89 38
28 87 79 90 58 29 51 45 29 84
37 82 30 76 58 33 81 55 68 91
[Average capacity used 58.7%; Flights unprofitable 51.43%]

4) A liquid is sold in bottles of a normal capacity of one liter. A sample of 200 bottles is
selected at random and the capacity of each determined. The following table shows the result
in grouped frequency distribution of the capacity of bottles:

Capacity (c.c) No. of bottles Capacity (c.c) No. of bottles

18
995.0-995.5 49 998.0-998.5 14
995.5-996.0 16 998.5-1001.0 13
996.0-996.5 15 1001.0-1002.5 11
996.5-997.0 14 2002.5-1005.0 14
997.0-997.5 10 1005.0-1010.0 8
997.5-998.0 12 1010.0-1030.0 14
116 84
(i) Calculate the arithmetic mean of the capacity of the capacity of the distribution.

(ii) What are the relative advantage and disadvantages of mean, the median and the mode of
this distribution as measures of "average" capacity?

5) Find the mean, median and modal age of married women at first child birth

Age 13 14 15 16 17 18 19 20 21 22 23 24 25
No. of 37 162 343 390 256 433 161 355 65 85 49 46 40
women
(Answer: mean = 17.175; median = 18; mode = 18)

6) The frequency distribution of weight in grams of mangoes of a given variety is given below.
Calculate the arithmetic mean, median and mode.

Weight (in gms) 410-419 420-429 430-439 440-449 450-459 460-469 470-479
No. of mangoes 14 20 42 54 45 18 7
(Answer: mean = 443.4; median = 443.94; mode = 445.21)

7) Calculate the average weight (in lbs.) from the following data:

Weight above 100 110 120 130 140 150


No. of Persons 400 304 170 100 80 50
(Answer: average weight = 122.5 lbs.)

19

You might also like