0% found this document useful (0 votes)
9 views

QA Chapter 1 Updated 1

The document outlines the course structure for Quantitative Analysis (CSDL06013) at Vidyalankar Institute of Technology for the academic year 2021-2022. It covers key concepts such as statistical analysis, data collection methods, regression analysis, and types of data classification. The course is designed to equip students with the necessary skills to analyze and interpret quantitative data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

QA Chapter 1 Updated 1

The document outlines the course structure for Quantitative Analysis (CSDL06013) at Vidyalankar Institute of Technology for the academic year 2021-2022. It covers key concepts such as statistical analysis, data collection methods, regression analysis, and types of data classification. The course is designed to equip students with the necessary skills to analyze and interpret quantitative data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

Vidyalankar Institute of Technology

Department of Computer Engineering


Academic Year 2021-2022
Semester: VI
Course Code Course Title Credit
CSDL06013 Quantitative Analysis 3

Faculty : Dr. Kavita P Shirsat


Why we need Quantitative
Analysis(quants)
 Information Explosion
 Noisy and huge amount of data
 Traditional Models are broken

Quantitative analysis (QA)


QA is a technique uses mathematical and statistical
modeling, measurement, and research to understand
behavior
How does quantitative analysis work?

 Statistical analysis methods


 basic calculations (for example, averages and medians)
 More sophisticated analyses (for example, correlations and regressions).

 Functions of Statistics
 Collection of data
 Tabulation of data
 Analysis of data
 Interpretation of results
Course Code Course Title Credit
CSDL06013 Quantitative Analysis 3
Module Content Hrs
1 Introduction to Statistics 6
Functions – Importance – Uses and Limitations of Statistics. Statistical data–Classification, Tabulation,
Diagrammatic & Graphic representation of data
2 Data Collection & Sampling Methods 5
Primary & Secondary data, Sources of data, Methods of collecting data. Sampling – Census & Sample methods
–Methods of sampling, Probability Sampling and Non-Probability Sampling.
3 Introduction to Regression 8
Mathematical and Statistical Equation – Meaning of Intercept and Slope – Error term – Measure for Model Fit
–R2 – MAE – MAPE
4 Introduction to Multiple Linear Regression 8
Multiple Linear Regression Model, Partial Regression Coefficients, Testing Significance overall significance of
Overall fit of the model, Testing for Individual Regression Coefficients
5 Statistical inference 6
Random sample -Parametric point estimation unbiasedness and consistence - method of moments and method
of maximum likelihood.
6 Tests of hypotheses 5
Null and Alternative hypotheses. Types of errors. Neyman-Pearson lemma-MP and UMP tests.
Textbooks:
1 Agarwal, B.L. (2006):-Basic Statistics. Wiley Eastern Ltd., New Delhi
2 Gupta, S. P. (2011):-Statistical Methods. Sultanchand&Sons, New Delhi
3 Sivathanupillai, M &Rajagopal, K. R. (1979):-Statistics for Economics Students.
4 Hogg ,R.V. and Craig, A.T.(2006), An introduction to mathematical statistics, Amerind publications.
5 Gupta S.C. and Kapoor V.K.(2003), Funadmental of Mathematical Statistics, Sultan Chand & company, New Delhi
What is Data?
Definition: Facts or figures, which are numerical or
otherwise, collected with a definite purpose are
called data.
Qualitative and Quantitative Data
 Qualitative Data: They represent some characteristics or attributes.
They depict descriptions that may be observed but cannot be computed
or calculated.
 For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class.

 Quantitative Data: These can be measured and not simply observed.


They can be numerically represented and calculations can be
performed on them.
 For example, data on the number of students playing different sports from
your class gives an estimate of how many of the total students play which
sport.
Nominal and Ordinal Data
 Nominal data
 Nominal data is used just for labeling variables, without any type of quantitative value.
 Examples of Nominal Data:
• Gender (Women, Men)
• Hair color (Blonde, Brown, Brunette, Red, etc.)
• Marital status (Married, Single, Widowed)
 Ordinal data
 Ordinal data shows where a number is in order.
 Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority.
 We cannot do arithmetic with ordinal numbers because they only show sequence.
 Examples of Ordinal Data:
• When a company asks a customer to rate the sales experience on a scale of 1-10.
• Economic status: low, medium and high.
Discrete vs Continuous Data
 Discrete data
 Discrete data is a count that involves only integers. The discrete values
cannot be subdivided into parts.
 Examples of discrete data:
• The number of students in a class.
• The number of workers in a company.
• The number of test questions you answered correctly
 Continuous data
 Continuous data is information that could be meaningfully divided into
finer levels. It can be measured on a scale or continuum and can have
almost any numeric value.
 Examples of continuous data:
• The amount of time required to complete a project.
• The height of children.
• The square footage of a two-bedroom house.
Classification of Data
 The placement of data in different homogenous groups based on some
characteristics or criteria is called classification.
 The classified data is presented in the form of well-arranged Tables.
 Systematics arrangement of data in rows and/or columns is called as
Tables

 Some norms for ideal classification and tabulation are


 The classes should be complete and non-overlapping
 Ex classify people according to marital status
 Clarity of classes
 Use standardized of classes so that comparison of results on time to time
becomes easy
 The units of the class should be same.
Modes of Classification

 Classification according to Attributes or Qualitative classification.


 Dichotomy or Two-fold Classification:- When on the basis of presence or absence of an
attribute, the data are classified into two classes--- one possessing that attribute, and the other
not possessing that attribute, it is called two-fold or dichotomous classification.
Population
with color blindness without color blindness

 Manifold Classification:- The classification, where two or more attributes are considered
and several classes are formed, is called manifold classification. First of all universe/population is
divided onto two classes on the basis of one attribute, after that each class is further divided into
two sub-classes on the basis of second attribute. If third attribute is also to be considered, each
sub-class is further classified into two sub-classes.
Modes of Classification
 Classification according to variables or Quantitative classification
 Geographical classification
 When data are classified on the basis of location or areas, it is called geographical
classification.
 This type of classification is based on geographical or locational differences between
various items in the data like states, cities, regions, zones etc. For eg. The yield of
agricultural output per hectare for different countries in some given period may be
presented as follows:

Agricultural Output of different countries (in Kg. per hectare)


Country India USA Pakistan Japan china
Avg. Output 125 585 140 410 330
Modes of Classification
 chronological classification.
 When data are classified with respect to different periods of time ( hour, day,
week, month, year, etc.) it is known as chronological or temporal classification.
For example, the population of India for different decades may be presented as
folows:

Population of India ( in Crores)


Year 1951 1961 1971 1981 1991 2000
Population 36.1 43.9 54.7 68.5 84.4 102.7
Quantitative classification
 In this type of classification there are two elements
 variable
Variable refers to the characteristic that varies in magnitude or
quantity. E.g. weight of the students. A variable may be discrete or
continuous.
 Frequency
Frequency refers to the number of times each variable gets repeated.
For example there are 50 students having weight of 60 kgs. Here 50
students is the frequency.
Frequency distribution
 The number of occurrence of the value is termed as the “frequency”
of that value.
 Frequency distribution refers to data classified on the basis of some
variable that can be measured such as prices, weight, height, wages
etc.
Frequency distribution
Grouped Frequency Distribution
• The tabulation of raw data by dividing the whole range of observations into a number of classes and
indicating the corresponding class-frequencies against the class-intervals, is called “grouped
frequency distribution”.
• Data can be further condensed by putting them into smaller groups, or, classes called “class-
Intervals”. The number of items which fall in a class-interval is called its “class frequency”

Sturges Rule for number of classes and size


of interval
Frequency distribution
 Grouped Frequency Distribution
Frequency distribution
The following technical terms are important when a continuous frequency
distribution is formed
Class limits: Class limits are the lowest and highest values that can be
included in a class. For example take the class 51-55. The lowest value of the
class is 51 and the highest value is 55. In this class there can be no value
lesser than 51 or more than 55. 51 is the lower class limit and 55 is the upper
class limit.
Class interval: The difference between the upper and lower limit of a class is
known as class interval of that class.
Class frequency: The number of observations corresponding to a particular
class is known as the frequency of that class.
Two types according to the class-intervals - (i) Exclusive Method (ii) Inclusive
Method.
Exclusive Method
 Exclusive Method : In this method the upper limit of a class becomes
the lower limit of the next class. It is called ' Exclusive ' as we do not
put any item that is equal to the upper limit of a class in the same
class; we put it in the next class.
 For example, a person of age 20 years will not be included in the
class-interval ( 10 - 20 ) but taken in the next class ( 20 - 30 ), since
in the class interval ( 10 - 20 ) only units ranging from 10 - 19 are
included. The exclusive-types of class-intervals can also be
expressed as

 0 and below 4.5 or 0 – 4.4


 4.5 and below 9.5 or 4.5 – 9.4
 9.5 and below 14.5 or 9.5- 14.4 .
Inclusive Method

 Inclusive Method : In this method the upper limit of any class


interval is kept in the same class-interval. In this method the
upper limit of a previous class is less by 1 from the lower limit of
the next class interval. In short this method allows a class-
interval to include both its lower and upper limits within it.
Example

Classes Tally marks No of children


(frequency)

2.0-2.4 |||| 5

2.4-2.8 |||| 5

2.8-3.2 |||| |||| 9

3.2-3.6 |||| 4

3.6-4.0 |||| 4

4.0-4.4 ||| 3
CUMULATIVE FREQUENCY DISTRIBUTION
• Cumulative frequency of a class-interval can be obtained by adding the frequency of
that class-interval to the sum of the frequencies of the preceding class-intervals.
• There are two types of cumulative frequencies
(1) less than (or, from below) cumulative frequency, and
(2) more than (or, from above) cumulative frequencies.

Classes Frequency Less than type Cumulative More than Cumulative


Frequency type Frequency

30-40 2 Less than 40 More than 30

40-50 3 Less than 50 More than 40

50-60 5 Less than 60 More than 50

60-70 7 Less than 70 More than 60

70-80 6 Less than 80 More than70

80-90 2 Less than 90 More than 80

Total 25
CUMULATIVE FREQUENCY DISTRIBUTION
• Cumulative frequency of a class-interval can be obtained by adding the frequency of
that class-interval to the sum of the frequencies of the preceding class-intervals.
• There are two types of cumulative frequencies
(1) less than (or, from below) cumulative frequency, and
(2) more than (or, from above) cumulative frequencies.
Distribution of home prices (1 unit is Lakh).
Classes Frequency Less than type Cumulative More than Cumulative
Frequency type Frequency

30-40 2 Less than 40 2 More than 30 25

40-50 3 Less than 50 5 More than 40 23

50-60 5 Less than 60 10 More than 50 20

60-70 7 Less than 70 17 More than 60 15

70-80 6 Less than 80 23 More than70 8

80-90 2 Less than 90 25 More than 80 2

Total 25
Relative frequency
 A relative frequency distribution consists of the relative frequencies, or proportions
(percentages), of observations belonging to each category.
 Relative frequencies have a useful interpretation: They give the chance or probability
of getting an observation from each category in a blind or random draw.

Classes Frequency Relative


Frequency in
percentage

30-40 2

40-50 3

50-60 5

60-70 7

70-80 6

80-90 2

Total 25 100
Relative frequency
 A relative frequency distribution consists of the relative frequencies, or proportions
(percentages), of observations belonging to each category.
 Relative frequencies have a useful interpretation: They give the chance or probability
of getting an observation from each category in a blind or random draw.

Distribution of home prices (1 unit is Lakh).


Classes Frequency Relative
Frequency in
percentage
House price
30-40 2 8
30-40
12 40-50
40-50 3 8% 8%
12% 50-60
20 24% 60-70
50-60 5
70-80
28 20% 80-90
60-70 7
28%
70-80 6 24

80-90 2 8

Total 25 100
Two-way Frequency Distribution (Bivariate)
 A frequency table where two variables have been measured in the same
set of items through cross classification is known as bivariate
frequency distribution or two-way frequency distribution.
For example, marks obtained by students on two subjects, ages of
husbands and wives, weights and heights of students etc. Bivariate Frequency
The following data represent the marks in X→ 5-15 15-25 Distribution
25-35 35-45 45-55 55-65 Total

Statistics (x) and Commerce (y) of 25 students.


y↓

7-19

19-31

31-43

43-55

55-67

67-79

78-7/1+3.322*log 25
Total
i=10.69 and K=6.6
PRESENTATION OF STATISTICAL
DATA
 Statistical data can be presented in three different ways:
 Textual presentation
 Tabular presentation, and
 Graphical presentation.
Textual presentation
 Textual presentation: This is a descriptive form.
Example
Presentation of data about deaths from industrial diseases in Great Britain in
1935–39 and 1940–44.
 Example . Numerical data with regard to industrial diseases and deaths
thereform in Great Britain during the years 1935–39 and 1940–44 are given
in a descriptive form: “During the quinquennium 1935–39, there were in
Great Britain 1, 775 cases of industrial diseases made up of 677 cases of lead
poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The
number of deaths reported was 20 p.c. of the cases for all the four diseases
taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total
number of cases reported was 2, 807. But lead poisoning cases reported fell
by 351 and anthrax cases by 35. Other poisoning cases increased by 784
between the two periods. The number of deaths reported decreased by 45
for lead poisoning, but decreased only by 2 for anthrax from the pre-war to
the post-war quinquennium. In the later period, 52 deaths were reported for
poisoning other than lead poisoning. The total number of deaths reported in
1940–44 including those from gassing was 64 greater than in 1935–39”
Textual presentation
 The disadvantages of textual presentation are:
 it is too lengthy
 there is repetition of words;
 comparisons cannot be made easily;
 it is difficult to get an idea and take appropriate action.
Tabular presentation, or,
Tabulation
 Tabulation may be defined as the systematic presentation of numerical data in
rows or/and columns according to certain characteristics.
 It expresses the data in concise and attractive form which can be easily
understood and used to compare numerical figures.
 Objectives of Tabulation
The main objectives of tabulation are stated below:
(i) to carry out investigation;
(ii) to do comparison;
(iii) to locate omissions and errors in the data;
(iv) to use space economically;
(v) to study the trend;
(vi) to simplify data;
(vii) to use it as future reference.
Tabular presentation, or, Tabulation

An ideal statistical table should contain the following items:


(i) Table number
(ii) Title
(iii) Date
(iv) Stubs, or, Row designations:
(v) Column headings, or, Captions
(vi) Body of the table
(vii) Unit of measurement
(viii) Source
(ix) Footnotes and references
Kinds of Table
I. According to Purpose
 General Purpose Table: General purpose table is that
table which is of general use. It is does not serve any
specific purpose or specific problem under consideration.
 Special Purpose Table: Special Purpose table is that
table which is prepared with some specific purpose in
mind.
II. According to Originality
 Original Table: An original table is that in which data are
presented in the same form and manner in which they are
collected.
 Derived Table: A derived table is that in which data are
not presented in the form or manner in which these are
collected. Instead the data are first converted into ratios
or percentage and then presented.
Kinds of Table
II. According to Construction
I

 Simple Table :In a simple table (also known as one-way table), data are presented based on only one characteristic.

 Complex Tables : In a complex table (also known as a manifold table) data are presented according to two or more characteristics simultaneously. The complex tables are two-way or three-way tables according to whether two or three characteristics are presented simultaneously.
a. Double or Two-Way Table
b. Three-Way Table
c. Manifold (or Higher Order) Table

Table 3.8
Marks of Students
Marks 30-40 40-50 50-60 60-70 70-80
Number of 14 26 30 20 10
Students
Kinds of Table

a. Double or Two-Way Table : In such a table, the variable under study is further subdivided into two
groups according to two inter-related characteristics. The two-way table is shown in Table 3.9
There are two characteristics, namely, marks secured by the students in the test and the
gender of the students. The table provides information relating to two interrelated
characteristics, such as marks and gender of students.
Kinds of Table

b. Three-Way Table : In such a table, the variable under study is divided according to three interrelated
characteristics. The Three- Way Table is shown Table 3.10 is an example for a three – way table with
three factors, namely, marks, gender and location.
Kinds of Table

c) Manifold (or Higher Order) Table : In such a tables provide information about a large no of
interrelated characteristics in the data set. Manifold (or Higher Order) Table is shown in Table 1.4.
Tabular presentation, or, Tabulation
 Example . Numerical data with regard to industrial diseases and deaths thereform in Great Britain during the years 1935–39 and
1940–44 are given in a descriptive form: “During the quinquennium 1935–39, there were in Great Britain 1, 775 cases of industrial
diseases made up of 677 cases of lead poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The number of deaths
reported was 20 p.c. of the cases for all the four diseases taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total number of cases reported was 2, 807. But lead poisoning
cases reported fell by 351 and anthrax cases by 35. Other poisoning cases increased by 784 between the two periods. The number of
deaths reported decreased by 45 for lead poisoning, but decreased only by 2 for anthrax from the pre-war to the post-war
quinquennium. In the later period, 52 deaths were reported for poisoning other than lead poisoning. The total number of deaths
reported in 1940–44 including those from gassing was 64 greater than in 1935–39”
Table 1.1 deaths form industrial diseases in Great Britain
Date:

1935-39 1940-44

Diseases Number of cases Number of deaths Number of cases Number of deaths

Lead poisoning

Anthrax

Gassing

Other poisoning

Total
Tabular presentation, or, Tabulation
 Example . Numerical data with regard to industrial diseases and deaths thereform in Great Britain during the years 1935–39 and
1940–44 are given in a descriptive form: “During the quinquennium 1935–39, there were in Great Britain 1, 775 cases of industrial
diseases made up of 677 cases of lead poisoning, 111 of other poisoning, 144 of anthrax, and 843 of gassing. The number of deaths
reported was 20 p.c. of the cases for all the four diseases taken together, that for lead poisoning was 135, for other poisoning 25 and
that for anthrax was 30. During the next quinquennium, 1940–44, the total number of cases reported was 2, 807. But lead poisoning
cases reported fell by 351 and anthrax cases by 35. Other poisoning cases increased by 784 between the two periods. The number of
deaths reported decreased by 45 for lead poisoning, but decreased only by 2 for anthrax from the pre-war to the post-war
quinquennium. In the later period, 52 deaths were reported for poisoning other than lead poisoning. The total number of deaths
reported in 1940–44 including those from gassing was 64 greater than in 1935–39”
Table 1.1 deaths form industrial diseases in Great Britain
Date:

1935-39 1940-44

Diseases Number of cases Number of deaths Number of cases Number of deaths

Lead poisoning 677 135 326 90

Anthrax 144 30 109 28

Gassing 843 165 1477 249

Other poisoning 111 25 895 52

Total 1775 355 2807 419


 Present the following information in a suitable tabular form:
(i) In 2010, out of total 2,000 workers in a factory, 1,550 were members of a trade union. The
number of women workers employees was 250, out of which 200 did not belong to any trade
union.
(ii) In 2017, the number of union workers was 1,725 of which 1,600 were men. The number of
non-union workers was 380, among which 155 were women.
Present the following information in a suitable tabular form:
(i) In 2010, out of total 2,000 workers in a factory, 1,550 were members
of a trade union. The number of women workers employees was 250,
out of which 200 did not belong to any trade union.
(ii) In 2017, the number of union workers was 1,725 of which 1,600
were men. The number of non-union workers was 380, among which
155were women.
Year 2010
Total no. of workers in the factory = 2000
Total no. of trade union workers = 1550
Total no. of non-trade union workers = (2000 −1550) = 450
Total no. of women worker = 250
Total no. of non-trade union women worker = 200
Total no. of trade union women worker = (250 − 200) = 50
Total no. of trade union men worker = (1550 − 50) = 1500
Total no. of non-trade union men worker = (450 − 200) = 250
Total no. of men workers = 1500 + 250 = 1750

Year 2017
Total no. of trade union workers = 1725
Total no. of trade union men worker = 1600
Total no. of non-trade union women worker = (1725 − 1600) = 125
Total no. of non-trade union workers = 380
Total no. of non-trade union women worker = 155
Total no. of non-trade union men worker = (380 − 155) = 225
Total no. of workers in the factory = 1725 + 380 = 2105
Total no. of men workers = 1600 + 225 = 1825
Total no. of women workers = 125 + 155 = 280
 Present the following information in a suitable
tabular form:
(i) In 2010, out of total 2,000 workers in a
factory, 1,550 were members of a trade union.
The number of women workers employees was
250, out of which 200 did not belong to any
trade union.
(ii) In 2017, the number of union workers was
1,725 of which 1,600 were men. The number
of non-union workers was 380, among which
155 were women.
Tabular presentation, or,
Tabulation
 The advantages of a tabular presentation
 it is concise;
 there is no repetition of explanatory matter
 comparisons can be made easily
 The important features can be highlighted;
and
 errors in the data can be detected.
Example
Draw up a blank table to show the number of employees in a large
commercial firm,classified according to (i) Sex: Male and Female; (ii) Three
age-groups: below 30, 30 and above but below 45, 45 and above; and (iii)
Four income-groups: below Rs. 400, Rs. 400–750, Rs. 750–1, 000,above Rs.
1, 000.
Graphical Presentation
 Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by shape,
center, and spread of the data. An individual value that falls outside the
overall pattern is called an outlier.

 Bar diagram and Pie charts are used for categorical variables.

 Histogram, stem and leaf and Box-plot are used for numerical
variable.
Histogram
 A histogram is a graphical display of data using bars of different
heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of continuous
sample data
Box Plotting

 Box plots (also called box-and-whisker


plots or box-whisker plots) give a good graphical
image of the concentration of the data.
 They also show how far the extreme values are from
most of the data.
 A box plot is constructed from five values: the
minimum value, the first quartile, the median, the third
quartile, and the maximum value.
Box Plotting

The image above is a boxplot. A boxplot is a standardized way of displaying the


distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
Stem and leaf plots
 Stem and leaf plots display the shape and spread of a continuous data
distribution.
 These graphs are similar to histograms, but instead of using bars, they show
digits.
 It’s a particularly valuable tool during exploratory data analysis. They can help you
identify the central tendency, variability, skewness of your distribution, and
outliers.
 Stem and leaf plots are also known as stemplots.
 Each data point is split into a stem and leaf value.
 The stem values divide the data points into groups.
 The stem value contains all the digits of a data point except the final number, which is
the leaf.
 For example, if a data point is 42, the stem is 4 and the leaf is 2. When your data
have more digits, you’ll need a longer stem. For instance, 238 has a stem of 23
and a leaf of 8.
Stem and leaf plots
 Example: A teacher asked 23 of her male students how many books they had read in the last 4 years.
Draw a stem and left diagram for following data.
32,55,12,18,23,26,24,28,44,53,27,16,14,34,29,31,41,17,11,32,54,48,45
 First, Sort the numerical values in ascending order
11,12,14,16,17,18,23,24,26,27,28,29,31,32,32,34,41,44,45,48,53,54,55
 Analyze the data and determine stem and a leaf : Minimum value is 11 and maximum value is 55.
 Draw a line and lists the possible stems in single column from lowest to highest on the left side of
the line.

 Then for each stem, record all the leaves associated with that stem. Also note that
leaves are in numerical order.
Measures of Centre Tendency
 In statistics, the central tendency is the descriptive summary of a
data set.
 Through the single value from the dataset, it reflects the centre of the
data distribution.
 Moreover, it does not provide information regarding individual data
from the dataset, where it gives a summary of the dataset. Generally,
the central tendency of a dataset can be defined using some of the
measures in statistics.
Mean
 The mean represents the average value of the dataset.
 It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
 Some other measures of mean used to find the central tendency
are as follows:
 Geometric Mean (nth root of the product of n numbers)
 Harmonic Mean (the reciprocal of the average of the reciprocals)
 Weighted Mean (where some values contribute more than others)
 It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the
sum of the elements of a set by the number of values in the set. So you
can use the layman term Average. If any data set consisting of the values
b1, b2, b3, …., bn then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)

The arithmetic mean of Virat Kohli’s batting scores also called his Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Geometric Mean
 The geometric mean is calculated as the N-th root of the product of all values, where N
is the number of values.
Geometric Mean = N-root(x1 * x2 * … * xN)

 For example, if the data contains only two values, the square root of the product of the
two values is the geometric mean. For three values, the cube-root is used, and so on.
 So, the geometric mean of our dataset is:
1 * 3 * 9 * 27 * 81 * 243 * 729 = 10,460,353,203
7th root of 10,460,353,203 = 27
geometric mean = 27
 The geometric mean is appropriate when the data contains values with different units
of measure, e.g. some measure are height, some are dollars, some are miles, etc.
Average, Geometric & Harmonic Means in Data Analysis

1, 3, 9, 27, 81, 243, 729


1 + 3 + 9 + 27 + 81 + 243 + 729 ÷ 7 = 156.1
156 isn’t particularly close to most of the numbers to the
dataset. In fact it’s more than 5x the median (middle
number), which is 27.

So what to do?
So, the geometric mean of our dataset is:
1 * 3 * 9 * 27 * 81 * 243 * 729 = 10,460,353,203
7th root of 10,460,353,203 = 27
geometric mean = 27
Example
For instance, we want to compare online ratings for two coffeeshops using two different sources.
The problem is that source 1 uses a 5-star scale & source 2 uses a 100-point scale:
Coffeeshop A
source 1 rating: 4.5
If we were a bit more number-savvy, we’d know that
source 2 rating: 68
we have to normalize our values onto the same scale
Coffeeshop B before averaging them with the arithmetic mean, to
source 1 rating: 3 get an accurate result.
source 2 rating: 75
So we multiply the source 1 ratings by 20 to bring
themfrom
Thea 5-star scale tomean,
geometric the 100-point scale of
however,
 If we naively take the arithmetic source 2:
allows us to reach the same
mean of raw ratings for each
conclusion without having to fuss
coffeeshop: Coffeeshop A

4.5over the scale or units of
* 20 = 90
Coffeeshop A = (4.5 + 68) ÷ 2 =
(90measure:
+ 68) ÷ 2 = 79
36.25
Coffeeshop B
 Coffeeshop A = square root of
 Coffeeshop B = (3 + 75) ÷ 2 = 39 3 * 20 = 60 * 68) = 17.5
(4.5
(60 + 75) ÷ 2 = 67.5
We’d conclude that Coffeeshop B  Coffeeshop B = square root of (3
was the winner. * 75)Coffeeshop
= 15
So we find that A is the true winner
We’d conclude that Coffeeshop A
contrary was
to the naive
the application of arithmetic mean
winner.
above.
Harmonic Mean
 The harmonic mean is calculated as the number of values N divided by the
sum of the reciprocal of the values (1 over each value).
Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)
 If there are just two values (x1 and x2), a simplified calculation of the
harmonic mean can be calculated as:
Harmonic Mean = (2 * x1 * x2) / (x1 + x2)
 Those involving rates and ratios, the harmonic mean gives the most correct
value of the mean.
e.g. speed, acceleration, frequency, etc.
For example,
if a vehicle travels a specified distance at speed x (eg 60 km / h) and then
travels again at the speed y (e.g.40 km / h),
The average speed value is the harmonic mean x, y is :

48 km / h.
How to Choose the Correct Mean?

 There are three different ways of calculating the average or


mean of a variable or dataset.
 The arithmetic mean is the most commonly used mean, although
it may not be appropriate in some cases.
 Each mean is appropriate for different types of data
 If values have the same units: Use the arithmetic mean.
 If values have differing units: Use the geometric mean.
 If values are rates: Use the harmonic mean.
Mean for grouped data

Masses 40-49 50-59 60-69 70-79 80-89 90-99


Frequency 6 8 12 14 7 3

CI f m fm
40-49 6 44.5 267
50-59 8 54.5 436
60-69 12 64.5 774
70-79 14 74.5 1043
80-89 7 84.5 591.5
90-99 3 94.5 283.5
Total 50 3395
Median
Median

Masses 40-49 50-59 60-69 70-79 80-89 90-99


Frequency 6 8 12 14 7 3

CI f cf Median class = (N/2)th value

40- 6 6 = (50/2)th value


49 = 25th value
50- 8 14  Median class = 60 - 69
59  l = 60, N//2 = 25, m = 14, f = 12
60- 12 26 and c = 10
Where l = Lower limit of the median69 Substitute
class, 70- 14 40
79
f = Frequency of the median class Median = 60 + ([25 - 14]/12) x 10
80- 7 47 = 60 + (11/12) x 10
c = Width of the median class, 89
= 60 + 9.1
N = The total frequency (∑f) 90- 3 50
= 69.1
m = cumulative frequency of the 99
≈ 69
class preceeding the median class Total 50
Median
Score 41-45 36-40 31-35 26-30 21-25 16-20
Frequency 1 8 8 14 7 2

Where l = Lower limit of the median


class,
f = Frequency of the median class
c = Width of the median class,
N = The total frequency (∑f)
m = cumulative frequency of the
class preceeding the median class
Median
Score 41-45 36-40 31-35 26-30 21-25 16-20
Frequency 1 8 8 14 7 2
<cf 40 39 31 23 9 2

Where l = Lower limit of the median


class,
f = Frequency of the median class
c = Width of the median class,
N = The total frequency (∑f)
m = cumulative frequency of the
class preceeding the median class
Median
Score 41-45 36-40 31-35 26-30 21-25 16-20
Frequency 1 8 8 14 7 2
<cf 40 39 31 23 9 2

Median class = (N/2)th value


= (40/2)th value
= 20th value
 Median class = 26- - 30
 l = 26, N//2 = 20, m = 9, f = 14 and
Where l = Lower limit of the median
c=5
class,
Substitute
f = Frequency of the median class
c = Width of the median class, Median = 26 + ([20 - 9]/14) x 5

N = The total frequency (∑f) = 26 + (11/14) x 5


= 26+3.92
m = cumulative frequency of the
class preceeding the median class = 29.92
≈ 30
Mode
Score 41-45 36-40 31-35 26-30 21-25 16-20
Frequency 1 8 8 14 7 2
HW
Box Plot
Represent the following data by a Rectangle:

To construct a Rectangle; first we find percentages


and cumulative percentages as given below:
Sub-divided Rectangles.
The following table gives the details of monthly budgets of two families:
Primary Data Vs Secondary Data
Primary Data
 Primary data is the data that is collected for the first time
through personal experiences or evidence, particularly for
research.
 It is also described as raw data or first-hand information.
 The mode of assembling the information is costly.
 The data is mostly collected through observations,
physical testing, mailed questionnaires, surveys, personal
interviews, telephonic interviews, case studies, and focus
groups, etc.
Primary Data Vs Secondary Data
Secondary Data
 Secondary data is a second-hand data that is already collected and recorded
by some researchers for their purpose, and not for the current research
problem.
 It is accessible in the form of data collected from different sources such as
government publications, censuses, internal records of the organisation,
books, journal articles, websites and reports, etc.
 This method of gathering data is affordable, readily available, and saves cost
and time.
 However, the one disadvantage is that the information assembled is for
some other purpose and may not meet the present research purpose or may
not be accurate.
Career and Job Opportunities
It’s not enough to have an MBA degree from a top-notch B-School.

Quantitative analyst is
the rocket scientist of the finance world.
 quantitative analysts apply a blend of techniques and knowledge
from multiple disciplines including
 Finance, Economics, mathematics, statistics
 and computer science.
Quant Analyst earn an average of ₹17lakhs, mostly ranging from ₹14lakhs per
year to ₹23lakhs per year
There is a high demand for quantitative analysts, but very low supply

Can a quantitative analyst become a data scientist?


The answer to this is YES.
Certification:

You might also like