QA Chapter 1 Updated 1
QA Chapter 1 Updated 1
Functions of Statistics
Collection of data
Tabulation of data
Analysis of data
Interpretation of results
Course Code Course Title Credit
CSDL06013 Quantitative Analysis 3
Module Content Hrs
1 Introduction to Statistics 6
Functions – Importance – Uses and Limitations of Statistics. Statistical data–Classification, Tabulation,
Diagrammatic & Graphic representation of data
2 Data Collection & Sampling Methods 5
Primary & Secondary data, Sources of data, Methods of collecting data. Sampling – Census & Sample methods
–Methods of sampling, Probability Sampling and Non-Probability Sampling.
3 Introduction to Regression 8
Mathematical and Statistical Equation – Meaning of Intercept and Slope – Error term – Measure for Model Fit
–R2 – MAE – MAPE
4 Introduction to Multiple Linear Regression 8
Multiple Linear Regression Model, Partial Regression Coefficients, Testing Significance overall significance of
Overall fit of the model, Testing for Individual Regression Coefficients
5 Statistical inference 6
Random sample -Parametric point estimation unbiasedness and consistence - method of moments and method
of maximum likelihood.
6 Tests of hypotheses 5
Null and Alternative hypotheses. Types of errors. Neyman-Pearson lemma-MP and UMP tests.
Textbooks:
1 Agarwal, B.L. (2006):-Basic Statistics. Wiley Eastern Ltd., New Delhi
2 Gupta, S. P. (2011):-Statistical Methods. Sultanchand&Sons, New Delhi
3 Sivathanupillai, M &Rajagopal, K. R. (1979):-Statistics for Economics Students.
4 Hogg ,R.V. and Craig, A.T.(2006), An introduction to mathematical statistics, Amerind publications.
5 Gupta S.C. and Kapoor V.K.(2003), Funadmental of Mathematical Statistics, Sultan Chand & company, New Delhi
What is Data?
Definition: Facts or figures, which are numerical or
otherwise, collected with a definite purpose are
called data.
Qualitative and Quantitative Data
Qualitative Data: They represent some characteristics or attributes.
They depict descriptions that may be observed but cannot be computed
or calculated.
For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class.
Manifold Classification:- The classification, where two or more attributes are considered
and several classes are formed, is called manifold classification. First of all universe/population is
divided onto two classes on the basis of one attribute, after that each class is further divided into
two sub-classes on the basis of second attribute. If third attribute is also to be considered, each
sub-class is further classified into two sub-classes.
Modes of Classification
Classification according to variables or Quantitative classification
Geographical classification
When data are classified on the basis of location or areas, it is called geographical
classification.
This type of classification is based on geographical or locational differences between
various items in the data like states, cities, regions, zones etc. For eg. The yield of
agricultural output per hectare for different countries in some given period may be
presented as follows:
2.0-2.4 |||| 5
2.4-2.8 |||| 5
3.2-3.6 |||| 4
3.6-4.0 |||| 4
4.0-4.4 ||| 3
CUMULATIVE FREQUENCY DISTRIBUTION
• Cumulative frequency of a class-interval can be obtained by adding the frequency of
that class-interval to the sum of the frequencies of the preceding class-intervals.
• There are two types of cumulative frequencies
(1) less than (or, from below) cumulative frequency, and
(2) more than (or, from above) cumulative frequencies.
Total 25
CUMULATIVE FREQUENCY DISTRIBUTION
• Cumulative frequency of a class-interval can be obtained by adding the frequency of
that class-interval to the sum of the frequencies of the preceding class-intervals.
• There are two types of cumulative frequencies
(1) less than (or, from below) cumulative frequency, and
(2) more than (or, from above) cumulative frequencies.
Distribution of home prices (1 unit is Lakh).
Classes Frequency Less than type Cumulative More than Cumulative
Frequency type Frequency
Total 25
Relative frequency
A relative frequency distribution consists of the relative frequencies, or proportions
(percentages), of observations belonging to each category.
Relative frequencies have a useful interpretation: They give the chance or probability
of getting an observation from each category in a blind or random draw.
30-40 2
40-50 3
50-60 5
60-70 7
70-80 6
80-90 2
Total 25 100
Relative frequency
A relative frequency distribution consists of the relative frequencies, or proportions
(percentages), of observations belonging to each category.
Relative frequencies have a useful interpretation: They give the chance or probability
of getting an observation from each category in a blind or random draw.
80-90 2 8
Total 25 100
Two-way Frequency Distribution (Bivariate)
A frequency table where two variables have been measured in the same
set of items through cross classification is known as bivariate
frequency distribution or two-way frequency distribution.
For example, marks obtained by students on two subjects, ages of
husbands and wives, weights and heights of students etc. Bivariate Frequency
The following data represent the marks in X→ 5-15 15-25 Distribution
25-35 35-45 45-55 55-65 Total
7-19
19-31
31-43
43-55
55-67
67-79
78-7/1+3.322*log 25
Total
i=10.69 and K=6.6
PRESENTATION OF STATISTICAL
DATA
Statistical data can be presented in three different ways:
Textual presentation
Tabular presentation, and
Graphical presentation.
Textual presentation
Textual presentation: This is a descriptive form.
Example
Presentation of data about deaths from industrial diseases in Great Britain in
1935–39 and 1940–44.
Example . Numerical data with regard to industrial diseases and deaths
thereform in Great Britain during the years 1935–39 and 1940–44 are given
in a descriptive form: “During the quinquennium 1935–39, there were in
Great Britain 1, 775 cases of industrial diseases made up of 677 cases of lead
poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The
number of deaths reported was 20 p.c. of the cases for all the four diseases
taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total
number of cases reported was 2, 807. But lead poisoning cases reported fell
by 351 and anthrax cases by 35. Other poisoning cases increased by 784
between the two periods. The number of deaths reported decreased by 45
for lead poisoning, but decreased only by 2 for anthrax from the pre-war to
the post-war quinquennium. In the later period, 52 deaths were reported for
poisoning other than lead poisoning. The total number of deaths reported in
1940–44 including those from gassing was 64 greater than in 1935–39”
Textual presentation
The disadvantages of textual presentation are:
it is too lengthy
there is repetition of words;
comparisons cannot be made easily;
it is difficult to get an idea and take appropriate action.
Tabular presentation, or,
Tabulation
Tabulation may be defined as the systematic presentation of numerical data in
rows or/and columns according to certain characteristics.
It expresses the data in concise and attractive form which can be easily
understood and used to compare numerical figures.
Objectives of Tabulation
The main objectives of tabulation are stated below:
(i) to carry out investigation;
(ii) to do comparison;
(iii) to locate omissions and errors in the data;
(iv) to use space economically;
(v) to study the trend;
(vi) to simplify data;
(vii) to use it as future reference.
Tabular presentation, or, Tabulation
Simple Table :In a simple table (also known as one-way table), data are presented based on only one characteristic.
Complex Tables : In a complex table (also known as a manifold table) data are presented according to two or more characteristics simultaneously. The complex tables are two-way or three-way tables according to whether two or three characteristics are presented simultaneously.
a. Double or Two-Way Table
b. Three-Way Table
c. Manifold (or Higher Order) Table
Table 3.8
Marks of Students
Marks 30-40 40-50 50-60 60-70 70-80
Number of 14 26 30 20 10
Students
Kinds of Table
a. Double or Two-Way Table : In such a table, the variable under study is further subdivided into two
groups according to two inter-related characteristics. The two-way table is shown in Table 3.9
There are two characteristics, namely, marks secured by the students in the test and the
gender of the students. The table provides information relating to two interrelated
characteristics, such as marks and gender of students.
Kinds of Table
b. Three-Way Table : In such a table, the variable under study is divided according to three interrelated
characteristics. The Three- Way Table is shown Table 3.10 is an example for a three – way table with
three factors, namely, marks, gender and location.
Kinds of Table
c) Manifold (or Higher Order) Table : In such a tables provide information about a large no of
interrelated characteristics in the data set. Manifold (or Higher Order) Table is shown in Table 1.4.
Tabular presentation, or, Tabulation
Example . Numerical data with regard to industrial diseases and deaths thereform in Great Britain during the years 1935–39 and
1940–44 are given in a descriptive form: “During the quinquennium 1935–39, there were in Great Britain 1, 775 cases of industrial
diseases made up of 677 cases of lead poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The number of deaths
reported was 20 p.c. of the cases for all the four diseases taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total number of cases reported was 2, 807. But lead poisoning
cases reported fell by 351 and anthrax cases by 35. Other poisoning cases increased by 784 between the two periods. The number of
deaths reported decreased by 45 for lead poisoning, but decreased only by 2 for anthrax from the pre-war to the post-war
quinquennium. In the later period, 52 deaths were reported for poisoning other than lead poisoning. The total number of deaths
reported in 1940–44 including those from gassing was 64 greater than in 1935–39”
Table 1.1 deaths form industrial diseases in Great Britain
Date:
1935-39 1940-44
Lead poisoning
Anthrax
Gassing
Other poisoning
Total
Tabular presentation, or, Tabulation
Example . Numerical data with regard to industrial diseases and deaths thereform in Great Britain during the years 1935–39 and
1940–44 are given in a descriptive form: “During the quinquennium 1935–39, there were in Great Britain 1, 775 cases of industrial
diseases made up of 677 cases of lead poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The number of deaths
reported was 20 p.c. of the cases for all the four diseases taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total number of cases reported was 2, 807. But lead poisoning
cases reported fell by 351 and anthrax cases by 35. Other poisoning cases increased by 784 between the two periods. The number of
deaths reported decreased by 45 for lead poisoning, but decreased only by 2 for anthrax from the pre-war to the post-war
quinquennium. In the later period, 52 deaths were reported for poisoning other than lead poisoning. The total number of deaths
reported in 1940–44 including those from gassing was 64 greater than in 1935–39”
Table 1.1 deaths form industrial diseases in Great Britain
Date:
1935-39 1940-44
Year 2017
Total no. of trade union workers = 1725
Total no. of trade union men worker = 1600
Total no. of non-trade union women worker = (1725 − 1600) = 125
Total no. of non-trade union workers = 380
Total no. of non-trade union women worker = 155
Total no. of non-trade union men worker = (380 − 155) = 225
Total no. of workers in the factory = 1725 + 380 = 2105
Total no. of men workers = 1600 + 225 = 1825
Total no. of women workers = 125 + 155 = 280
Present the following information in a suitable
tabular form:
(i) In 2010, out of total 2,000 workers in a
factory, 1,550 were members of a trade union.
The number of women workers employees was
250, out of which 200 did not belong to any
trade union.
(ii) In 2017, the number of union workers was
1,725 of which 1,600 were men. The number
of non-union workers was 380, among which
155 were women.
Tabular presentation, or,
Tabulation
The advantages of a tabular presentation
it is concise;
there is no repetition of explanatory matter
comparisons can be made easily
The important features can be highlighted;
and
errors in the data can be detected.
Example
Draw up a blank table to show the number of employees in a large
commercial firm,classified according to (i) Sex: Male and Female; (ii) Three
age-groups: below 30, 30 and above but below 45, 45 and above; and (iii)
Four income-groups: below Rs. 400, Rs. 400–750, Rs. 750–1, 000,above Rs.
1, 000.
Graphical Presentation
Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by shape,
center, and spread of the data. An individual value that falls outside the
overall pattern is called an outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical
variable.
Histogram
A histogram is a graphical display of data using bars of different
heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of continuous
sample data
Box Plotting
Then for each stem, record all the leaves associated with that stem. Also note that
leaves are in numerical order.
Measures of Centre Tendency
In statistics, the central tendency is the descriptive summary of a
data set.
Through the single value from the dataset, it reflects the centre of the
data distribution.
Moreover, it does not provide information regarding individual data
from the dataset, where it gives a summary of the dataset. Generally,
the central tendency of a dataset can be defined using some of the
measures in statistics.
Mean
The mean represents the average value of the dataset.
It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
Some other measures of mean used to find the central tendency
are as follows:
Geometric Mean (nth root of the product of n numbers)
Harmonic Mean (the reciprocal of the average of the reciprocals)
Weighted Mean (where some values contribute more than others)
It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the
sum of the elements of a set by the number of values in the set. So you
can use the layman term Average. If any data set consisting of the values
b1, b2, b3, …., bn then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)
The arithmetic mean of Virat Kohli’s batting scores also called his Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Geometric Mean
The geometric mean is calculated as the N-th root of the product of all values, where N
is the number of values.
Geometric Mean = N-root(x1 * x2 * … * xN)
For example, if the data contains only two values, the square root of the product of the
two values is the geometric mean. For three values, the cube-root is used, and so on.
So, the geometric mean of our dataset is:
1 * 3 * 9 * 27 * 81 * 243 * 729 = 10,460,353,203
7th root of 10,460,353,203 = 27
geometric mean = 27
The geometric mean is appropriate when the data contains values with different units
of measure, e.g. some measure are height, some are dollars, some are miles, etc.
Average, Geometric & Harmonic Means in Data Analysis
So what to do?
So, the geometric mean of our dataset is:
1 * 3 * 9 * 27 * 81 * 243 * 729 = 10,460,353,203
7th root of 10,460,353,203 = 27
geometric mean = 27
Example
For instance, we want to compare online ratings for two coffeeshops using two different sources.
The problem is that source 1 uses a 5-star scale & source 2 uses a 100-point scale:
Coffeeshop A
source 1 rating: 4.5
If we were a bit more number-savvy, we’d know that
source 2 rating: 68
we have to normalize our values onto the same scale
Coffeeshop B before averaging them with the arithmetic mean, to
source 1 rating: 3 get an accurate result.
source 2 rating: 75
So we multiply the source 1 ratings by 20 to bring
themfrom
Thea 5-star scale tomean,
geometric the 100-point scale of
however,
If we naively take the arithmetic source 2:
allows us to reach the same
mean of raw ratings for each
conclusion without having to fuss
coffeeshop: Coffeeshop A
4.5over the scale or units of
* 20 = 90
Coffeeshop A = (4.5 + 68) ÷ 2 =
(90measure:
+ 68) ÷ 2 = 79
36.25
Coffeeshop B
Coffeeshop A = square root of
Coffeeshop B = (3 + 75) ÷ 2 = 39 3 * 20 = 60 * 68) = 17.5
(4.5
(60 + 75) ÷ 2 = 67.5
We’d conclude that Coffeeshop B Coffeeshop B = square root of (3
was the winner. * 75)Coffeeshop
= 15
So we find that A is the true winner
We’d conclude that Coffeeshop A
contrary was
to the naive
the application of arithmetic mean
winner.
above.
Harmonic Mean
The harmonic mean is calculated as the number of values N divided by the
sum of the reciprocal of the values (1 over each value).
Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)
If there are just two values (x1 and x2), a simplified calculation of the
harmonic mean can be calculated as:
Harmonic Mean = (2 * x1 * x2) / (x1 + x2)
Those involving rates and ratios, the harmonic mean gives the most correct
value of the mean.
e.g. speed, acceleration, frequency, etc.
For example,
if a vehicle travels a specified distance at speed x (eg 60 km / h) and then
travels again at the speed y (e.g.40 km / h),
The average speed value is the harmonic mean x, y is :
48 km / h.
How to Choose the Correct Mean?
CI f m fm
40-49 6 44.5 267
50-59 8 54.5 436
60-69 12 64.5 774
70-79 14 74.5 1043
80-89 7 84.5 591.5
90-99 3 94.5 283.5
Total 50 3395
Median
Median
Quantitative analyst is
the rocket scientist of the finance world.
quantitative analysts apply a blend of techniques and knowledge
from multiple disciplines including
Finance, Economics, mathematics, statistics
and computer science.
Quant Analyst earn an average of ₹17lakhs, mostly ranging from ₹14lakhs per
year to ₹23lakhs per year
There is a high demand for quantitative analysts, but very low supply