Data Analysis 3rd Sem
Data Analysis 3rd Sem
Primary data is the data that is collected for the first time through personal experiences or
evidence, particularly for research. It is also described as raw data or first-hand information.
The mode of assembling the information is costly, as the analysis is done by an agency or
an external organisation, and needs human resources and investment. The investigator
supervises and controls the data collection process directly.
The data is mostly collected through observations, physical testing, mailed
questionnaires, surveys, personal interviews, telephonic interviews, case studies, and focus
groups, etc.
Books, Magazines, and Newspapers Published articles of local bodies, and central and state
governments. Statistical synopses, census records, and other reports issued by the different
departments of the government.
Q. what are the sources of primary data?
Journals Diaries Correspondence / letters.
Internal sources
When data is collected from reports and records of the organisation itself, they are
known as the internal sources.
For example, a company publishes its annual report’ on profit and loss, total sales,
loans, wages, etc.
External sources
When data is collected from sources outside the organisation, they are known as the
external sources. For example, if a tour and travel company obtains information on
Karnataka tourism from Karnataka Transport Corporation, it would be known as an
external source of data.
Investigator ● One who conducts investigation, i.e., statistical enquiry and seeks
information is known as an investigator.
● It can be an individual person or an organisation.
A measure of dispersion indicates the scattering of data. It explains the disparity of data
from one another, delivering a precise view of their distribution. The measure of
dispersion displays and gives us an idea about the variation and the central value of an
individual item.
Characteristics of a Good Measure of Dispersion
Line Graphs – Line graph or the linear graph is used to display the continuous
data and it is useful for predicting future events over time.
Bar Graphs – Bar Graph is used to display the category of data and it compares
the data using solid bars to represent the quantities.
Histograms – The graph that uses bars to represent the frequency of numerical
data that are organised into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.
Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed
above a number line each time when that data occurs again.
Frequency Table – The table shows the number of pieces of data that falls
within the given interval.
Circle Graph – Also known as the pie chart that shows the relationships of the
parts of the whole. The circle is considered with 100% and the categories
occupied is represented with that specific percentage like 15%, 56%, etc.
Some of the merits of using graphs are as follows:
Median:
The median is that value of the series which divides the group into two equal parts, one
part comprising all values greater than the median value and the other part comprising all
the values smaller than the median value.
Merits of median
(1) Simplicity:- It is very simple measure of the central tendency of the series. I the case of
simple statistical series, just a glance at the data is enough to locate the median value.
(2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not
destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are always a
certain specific value in the series.
(4) Real value: - Median value is real value and is a better representative value of the
series compared to arithmetic mean average, the value of which may not exist in the series
at all.
(5) Graphic presentation: - Besides algebraic approach, the median value can be
estimated also through the graphic presentation of data.
(6) Possible even when data is incomplete: - Median can be estimated even in the case of
certain incomplete series. It is enough if one knows the number of items and the middle
item of the series.
Demerits of median:
Following are the various demerits of median:
Mode:
The value of the variable which occurs most frequently in a distribution is called the mode.
Merits of mode:
(1) Simple and popular: - Mode is very simple measure of central tendency.
Sometimes, just at the series is enough to locate the model value. Because of
its simplicity, it s a very popular measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less
affected by marginal values in the series. Mode is determined only by the
value with highest frequencies.
(3) Graphic presentation:- Mode can be located graphically, with the help of
histogram.
(4) Best representative: - Mode is that value which occurs most frequently
in the series. Accordingly, mode is the best representative value of the series.
(5) No need of knowing all the items or frequencies: - The calculation of
mode does not require knowledge of all the items and frequencies of a
distribution. In simple series, it is enough if one knows the items with highest
frequencies in the distribution.
Demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vague measure of the
central tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable
of further algebraic treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to
identify the modal value.
(4) Complex procedure of grouping:- Calculation of mode involves
cumbersome procedure of grouping the data. If the extent of grouping
changes there will be a change in the model value.
Q. Define correlation
Correlation is a statistical measure that indicates the extent to which
two or more variables fluctuate in relation to each other. A positive
correlation indicates the extent to which those variables increase or decrease
in parallel; a negative correlation indicates the extent to which one variable
increases as the other decreases.
Scatter Diagram with Strong Positive Correlation
Government publications
Public records
Historical and statistical documents
Business documents
Technical and trade journals
Unpublished data includes
Diaries
Letters
Unpublished biographies, etc.
Coefficient of Correlation : A coefficient of correlation is generally applied in statistics to
calculate a relationship between two variables. The correlation shows a specific value of
the degree of a linear relationship between the X and Y variables, say X and Y. There are
various types of correlation coefficients. However, Pearson’s correlation (also known as
Pearson’s R) is the correlation coefficient that is frequently used in linear regression.
2. Causal Relationship: There is no cause and effect relationship between the two variables under
study. However, there exists a cause-and-effect relationship between the forces affecting the two
variables. Correlation is meaningless if there is no such relationship.
3. Normal Distribution: The two variables under study are affected by a large number of the
independent clauses of such a nature as to produce normal distribution. Variables such as height,
weight, the color of skin, etc. are affected by a multiplicity of forces.
4. Error of Measurement: The coefficient of correlation is more reliable if the error of measurement
is reduced to the minimum.
5. Both variables are on an interval or ratio level of measurement
6. Data from both variables follow normal distributions
7. Your data have no outliers
8. Your data is from a random or representative sample
9. You expect a linear relationship between the two variables
Q. Write a short note on regression coefficient
Range:
The difference between the lowest and highest values.
In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6.
Range can also mean all the output values of a function.
1) Correlation coefficient remains in the same measurement as in which the two variables
are.
2) The sign which correlations of coefficient have will always be the same as the variance.
3) The numerical value of correlation of coefficient will be in between -1 to + 1. It is known
as real number value.
4) The negative value of coefficient suggests that the correlation is strong and negative.
And if ‘r’ goes on approaching toward -1 then it means that the relationship is going towards
the negative side.
When ‘r’ approaches to the side of + 1 then it means the relationship is strong and positive.
By this we can say that if +1 is the result of the correlation then the relationship is in a
positive state.
5) The weak correlation is signaled when the coefficient of correlation approaches to zero.
When ‘r’ is near about zero then we can deduce that the relationship is weak.
6) Correlation coefficient can be very dicey because we cannot say that the participants are
truthful or not.
The coefficient of correlation is not affected when we interchange the two variables.
7) Coefficient of correlation is a pure number without effect of any units on it. It also not get
affected when we add the same number to all the values of one variable. We can multiply
all the variables by the same positive number. It does not affect the correlation coefficient.
As we discussed, ‘r ‘is not affected by any unit because ‘r’ is a scale invariant.
8) We use correlation for measuring the association but that does not mean we are talking
about causation. By this, we simply mean that when we are correlating the two variables
then it might be the possibility that the third variable may be influencing them.
Example 2: Calculate the Correlation coefficient of given data:
x 12 15 18 21 27
y 2 4 6 8 12
Solution:
Here n = 5
x 12 15 18 21 27
y 2 4 6 8 12
xy 24 60 94 168 324
y2 4 16 36 64 144
∑x = 93
∑y = 32
∑xy = 670
∑x2 = 1863
∑y2 = 264
Now, substitute all the values in the below formula.
We have, r = 0.84