0% found this document useful (0 votes)
2 views

Data Analysis 3rd Sem

The document defines primary and secondary data, highlighting that primary data is first-hand information collected for research, while secondary data is pre-existing information gathered for other purposes. It discusses various sources of both data types, methods of data collection, and measures of central tendency and dispersion, including mean, median, and mode. Additionally, it covers the importance of graphical representation in data analysis and the characteristics of effective questionnaires.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analysis 3rd Sem

The document defines primary and secondary data, highlighting that primary data is first-hand information collected for research, while secondary data is pre-existing information gathered for other purposes. It discusses various sources of both data types, methods of data collection, and measures of central tendency and dispersion, including mean, median, and mode. Additionally, it covers the importance of graphical representation in data analysis and the characteristics of effective questionnaires.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Primary Data Definition

Primary data is the data that is collected for the first time through personal experiences or
evidence, particularly for research. It is also described as raw data or first-hand information.
The mode of assembling the information is costly, as the analysis is done by an agency or
an external organisation, and needs human resources and investment. The investigator
supervises and controls the data collection process directly.
The data is mostly collected through observations, physical testing, mailed
questionnaires, surveys, personal interviews, telephonic interviews, case studies, and focus
groups, etc.

Secondary Data Definition


Secondary data is a second-hand data that is already collected and recorded by some
researchers for their purpose, and not for the current research problem. It is accessible in
the form of data collected from different sources such as government publications,
censuses, internal records of the organisation, books, journal articles, websites and reports,
etc.
This method of gathering data is affordable, readily available, and saves cost and time.
However, the one disadvantage is that the information assembled is for some other purpose
and may not meet the present research purpose or may not be accurate.

Q. what are the sources of secondary data?

Books, Magazines, and Newspapers Published articles of local bodies, and central and state
governments. Statistical synopses, census records, and other reports issued by the different
departments of the government.
Q. what are the sources of primary data?
Journals Diaries Correspondence / letters.

What are the different sources of data?


The following are the two sources of data:

Internal sources

When data is collected from reports and records of the organisation itself, they are
known as the internal sources.

For example, a company publishes its annual report’ on profit and loss, total sales,
loans, wages, etc.

External sources

When data is collected from sources outside the organisation, they are known as the
external sources. For example, if a tour and travel company obtains information on
Karnataka tourism from Karnataka Transport Corporation, it would be known as an
external source of data.
Investigator ● One who conducts investigation, i.e., statistical enquiry and seeks
information is known as an investigator.
● It can be an individual person or an organisation.

Enumerator ● An enumerator is a person who helps investigators in the collection of data.

Informant ● An informant is the respondent who supplies the information to the


investigators or enumerators.

Q. what do you mean by measures of dispersion?

A measure of dispersion indicates the scattering of data. It explains the disparity of data
from one another, delivering a precise view of their distribution. The measure of
dispersion displays and gives us an idea about the variation and the central value of an
individual item.
Characteristics of a Good Measure of Dispersion

 It should be easy to calculate and simple to understand.


 It should be based on all the observations of the series.
 It should be rigidly defined.
 It should not be affected by extreme values.
 It should not be unduly affected by sampling fluctuations.
 It should be capable of further mathematical treatment and statistical analysis.

Q. Write the meaning of Excel function?


A function is a preset formula in Excel, that helps perform mathematical, statistical and
logical operations. In Excel, a function is a predefined formula that performs a specific
calculation by using values a user input as arguments
Graphical Representation is a way of analysing numerical data. It exhibits the relation
between data, ideas, information and concepts in a diagram. It is easy to understand and it
is one of the most important learning strategies. It always depends on the type of
information in a particular domain. There are different types of graphical representation.

 Line Graphs – Line graph or the linear graph is used to display the continuous
data and it is useful for predicting future events over time.
 Bar Graphs – Bar Graph is used to display the category of data and it compares
the data using solid bars to represent the quantities.
 Histograms – The graph that uses bars to represent the frequency of numerical
data that are organised into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.
 Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed
above a number line each time when that data occurs again.
 Frequency Table – The table shows the number of pieces of data that falls
within the given interval.
 Circle Graph – Also known as the pie chart that shows the relationships of the
parts of the whole. The circle is considered with 100% and the categories
occupied is represented with that specific percentage like 15%, 56%, etc.
Some of the merits of using graphs are as follows:

 The graph is easily understood by everyone without any prior knowledge.


 It saves time
 It allows us to relate and compare the data for different time periods
 It is used in statistics to determine the mean, median and mode for different data,
as well as in the interpolation and the extrapolation of data.
Advantages of Graphical Representation of Data

1. Acceptability: Such a report is acceptable to busy persons because it easily highlights


the theme of the report. This helps to avoid waste of time.
2. Comparative Analysis: Information can be compared in terms of graphical
representation. Such comparative analysis helps for quick understanding and attention.
3. Less cost: Information if descriptive involves huge time to present properly. It involves
more money to print the information but the graphical presentation can be made in a short
but catchy view to make the report understandable. It obviously involves less cost.
4. Decision Making: Business executives can view the graphs at a glance and can make a
decision very quickly which is hardly possible through descriptive reports.
5. Logical Ideas: If tables, designs, and graphs are used to represent information then a
logical sequence is created to clear the idea of the audience.
6. Helpful for less literate Audience: Less literate or illiterate people can understand
graphical representation easily because it does not involve going through line-by-line and
descriptive reports.
7. Less Effort and Time: To present any table, design, image, or graph require less effort
and time. Furthermore, such a presentation makes a quick understanding of the
information.
8. Less Error and Mistakes: Qualitative or informative or descriptive reports involve
errors or mistakes. As graphical representations are exhibited through numerical figures,
tables, or graphs, it usually involves fewer errors and mistakes.
9. A complete Idea: Such representation creates a clear and complete idea in the mind of
the audience. Reading a hundred pages may not give any scope to make a decision. But an
instant view or looking at a glance obviously makes an impression in the mind of the
audience regarding the topic or subject.
10. Use in the Notice Board: Such representation can be hung on the notice board to
quickly raise the attention of employees in any organization.

Q. what are the points to be followed before drafting a good questionnaire or


schedule?

1. Limited number of questions: The number of questions should be limited as


far as possible. Normally 15 to 20 questions are sufficient enough for making the
required enquiry.
2. Simplicity: The language of the question should be simple and easily
understandable. It should be clear and not be vague. It should not convey two
meanings.
3. Logically arrange: The question should be arranged logically. There question
be a proper sequence of the questions.
4. Related to the points: Questions should be related to the point. They should
not be irrelevant.
5. Avoiding personal questions: Personal questions should be avoided as far as
possible. For example, questions about Income, volume of sales, etc. should not
be asked.
MEAN
The arithmetic mean (or simply "mean") of a sample is the sum of the sampled values
divided by the number of items in the sample.

MERITS OF ARITHEMETIC MEAN


l ARITHEMETIC MEAN RIGIDLY DEFINED BY ALGEBRIC FORMULA
l It is easy to calculate and simple to understand
l IT BASED ON ALL OBSERVATIONS AND IT CAN BE REGARDED AS REPRESENTATIVE OF THE
GIVEN DATA
l It is capable of being treated mathematically and hence it is widely used in statistical
analysis.
l Arithmetic mean can be computed even if the detailed distribution is not known but some of
the observation and number of the observation are known.
l It is least affected by the fluctuation of sampling

DEMERITS OF ARITHMETIC MEAN


l It can neither be determined by inspection or by graphical location
l Arithmetic mean cannot be computed for qualitative data like data on intelligence honesty
and smoking habit etc
l It is too much affected by extreme observations and hence it is not adequately represent data
consisting of some extreme point
l Arithmetic mean cannot be computed when class intervals have open ends

Median:
The median is that value of the series which divides the group into two equal parts, one
part comprising all values greater than the median value and the other part comprising all
the values smaller than the median value.

Merits of median

(1) Simplicity:- It is very simple measure of the central tendency of the series. I the case of
simple statistical series, just a glance at the data is enough to locate the median value.
(2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not
destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are always a
certain specific value in the series.
(4) Real value: - Median value is real value and is a better representative value of the
series compared to arithmetic mean average, the value of which may not exist in the series
at all.
(5) Graphic presentation: - Besides algebraic approach, the median value can be
estimated also through the graphic presentation of data.
(6) Possible even when data is incomplete: - Median can be estimated even in the case of
certain incomplete series. It is enough if one knows the number of items and the middle
item of the series.

Demerits of median:
Following are the various demerits of median:

Lack of representative character: - Median fails to be a representative


measure in case of such series the different values of which are wide apart from
each other. Also, median is of limited representative character as it is not based
on all the items in the series.
(2) Unrealistic:- When the median is located somewhere between the two
middle values, it remains only an approximate measure, not a precise value.
(3) Lack of algebraic treatment: - Arithmetic mean is capable of further
algebraic treatment, but median is not. For example, multiplying the median
with the number of items in the series will not give us the sum total of the values
of the series.

Mode:

The value of the variable which occurs most frequently in a distribution is called the mode.

Merits of mode:

(1) Simple and popular: - Mode is very simple measure of central tendency.
Sometimes, just at the series is enough to locate the model value. Because of
its simplicity, it s a very popular measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less
affected by marginal values in the series. Mode is determined only by the
value with highest frequencies.
(3) Graphic presentation:- Mode can be located graphically, with the help of
histogram.
(4) Best representative: - Mode is that value which occurs most frequently
in the series. Accordingly, mode is the best representative value of the series.
(5) No need of knowing all the items or frequencies: - The calculation of
mode does not require knowledge of all the items and frequencies of a
distribution. In simple series, it is enough if one knows the items with highest
frequencies in the distribution.

Demerits of mode:

(1) Uncertain and vague: - Mode is an uncertain and vague measure of the
central tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable
of further algebraic treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to
identify the modal value.
(4) Complex procedure of grouping:- Calculation of mode involves
cumbersome procedure of grouping the data. If the extent of grouping
changes there will be a change in the model value.

(5) Ignores extreme marginal frequencies:- It ignores extreme marginal


frequencies. To that extent model value is not a representative value of all
the items in a series. Besides, one can question the representative character
of the model value as its calculation does not involve all items of the series.

Q. Which measure of dispersion is the best and how?

Standard deviation is considered to be the best measure of dispersion and is


therefore, the most widely used measure of dispersion.
(i) It is based on all values and thus provides information about the complete
series. Because of this reason, a change in even one value affects the value
of the standard deviation.
(ii) It is independent of origin but not of scale.
(iii) It is useful in advanced statistical calculation like the comparison of
variability in two data sets.
(iv) It can be used in testing of hypothesis.
(v) It is capable of further algebraic treatment.

1. Most of the statistical theory is based on Standard Deviation. It helps to


make comparison between variability of two or more sets of data. Also,
Standard Deviation helps in testing the significance of random samples and
in regression and correlation analysis.
2. It is based on the values of all the observations. In other words,
Standard Deviation makes use of every item in a particular distribution.
3. Standard Deviation has a precise value and is a well-defined and definite
measure of dispersion. That is, it is rigidly defined.
4. It is independent of the origin.
5. It is capable of further algebraic treatment.

Q. Define correlation
Correlation is a statistical measure that indicates the extent to which
two or more variables fluctuate in relation to each other. A positive
correlation indicates the extent to which those variables increase or decrease
in parallel; a negative correlation indicates the extent to which one variable
increases as the other decreases.
Scatter Diagram with Strong Positive Correlation

This diagram is known as a “Scatter Diagram with Positive Slant.”


In a positive slant, the correlation is positive, i.e., as the value of X increases, the value of Y
will increase. You can say that the slope of a straight line drawn along the data points will
go up. The pattern resembles a straight line.
For example, cold drink sales will increase if the weather gets hotter.

Scatter Diagram with Strong Negative Correlation

This diagram is known as a “Scatter Diagram with a Negative Slant.”


In the negative slant, the correlation is negative, i.e., as the value of X increases, the value of
Y will decrease. The slope of a straight line drawn along the data points will go down.
For example, if the temperature increases, the sale of winter coats goes down.
Scatter Diagram with No Correlation
This diagram is known as the “Scatter Diagram with Zero Degree of Correlation.”
Here, the data point spread is so random that you cannot draw a line through them.
Therefore, you can conclude that these variables do not correlate.

Primary Data Collection Methods:


Primary data or raw data is a type of information that is obtained directly from the first-
hand source through experiments, surveys or observations. The primary data collection
method is further classified into two types. They are

 Quantitative Data Collection Methods


 Qualitative Data Collection Methods

Quantitative Data Collection Methods


It is based on mathematical calculations using various formats like close-ended
questions, correlation and regression methods, mean, median or mode measures. This
method is cheaper than qualitative data collection methods and it can be applied in a short
duration of time.

Qualitative Data Collection Methods


It does not involve any mathematical calculations. This method is closely associated with
elements that are not quantifiable. This qualitative data collection method includes
interviews, questionnaires, observations, case studies, etc. There are several methods to
collect this type of data. They are
Observation Method
Observation method is used when the study relates to behavioural science. This method is
planned systematically. It is subject to many controls and checks. The different types of
observations are:

 Structured and unstructured observation


 Controlled and uncontrolled observation
 Participant, non-participant and disguised observation
Interview Method
The method of collecting data in terms of verbal responses. It is achieved in two ways, such
as

 Personal Interview – In this method, a person known as an interviewer is required


to ask questions face to face to the other person. The personal interview can be
structured or unstructured, direct investigation, focused conversation, etc.
 Telephonic Interview – In this method, an interviewer obtains information by
contacting people on the telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply
and subsequently return the questionnaire. The questions are printed in the definite order
on the form. A good survey should have the following features:

 Short and simple


 Should follow a logical sequence
 Provide adequate space for answers
 Avoid technical terms
 Should have good physical appearance such as colour, quality of the paper to attract
the attention of the respondent
Schedules
This method is similar to the questionnaire method with a slight difference. The
enumerations are specially appointed for the purpose of filling the schedules. It explains
the aims and objects of the investigation and may remove misunderstandings, if any have
come up. Enumerators should be trained to perform their job with hard work and patience.

Secondary Data Collection Methods


Secondary data is data collected by someone other than the actual user. It means that
the information is already available, and someone analyses it. The secondary data
includes magazines, newspapers, books, journals, etc. It may be either published data
or unpublished data.
Published data are available in various resources including

 Government publications
 Public records
 Historical and statistical documents
 Business documents
 Technical and trade journals
Unpublished data includes

 Diaries
 Letters
 Unpublished biographies, etc.
Coefficient of Correlation : A coefficient of correlation is generally applied in statistics to
calculate a relationship between two variables. The correlation shows a specific value of
the degree of a linear relationship between the X and Y variables, say X and Y. There are
various types of correlation coefficients. However, Pearson’s correlation (also known as
Pearson’s R) is the correlation coefficient that is frequently used in linear regression.

Pearson’s Coefficient Correlation: Karl Pearson’s coefficient of correlation is an


extensively used mathematical method in which the numerical representation is applied to
measure the level of relation between linearly related variables. The coefficient of
correlation is expressed by “r”.

Karl Pearson Correlation Coefficient Formula

Q. What are the assumptions of the Pearson correlation coefficient?


1. Linear Relationship: In this method, it is assumed that there is a linear relationship between the
variables, that is, if the paired observations of both the variables are plotted on the scatter
diagrams, the plotted points will form a straight line.

2. Causal Relationship: There is no cause and effect relationship between the two variables under
study. However, there exists a cause-and-effect relationship between the forces affecting the two
variables. Correlation is meaningless if there is no such relationship.

3. Normal Distribution: The two variables under study are affected by a large number of the
independent clauses of such a nature as to produce normal distribution. Variables such as height,
weight, the color of skin, etc. are affected by a multiplicity of forces.
4. Error of Measurement: The coefficient of correlation is more reliable if the error of measurement
is reduced to the minimum.
5. Both variables are on an interval or ratio level of measurement
6. Data from both variables follow normal distributions
7. Your data have no outliers
8. Your data is from a random or representative sample
9. You expect a linear relationship between the two variables
Q. Write a short note on regression coefficient

Regression coefficient is a statistical measure of the average functional relationship


between two or more variables. In regression analysis, one variable is considered as
dependent and other(s) as independent. Thus, it measures the degree of
dependence of one variable on the other(s). Regression coefficient was first used for
estimating the relationship between the heights of fathers and their sons.

Properties of Regression Coefficient:


1. It is denoted by b.
2. It is expressed in terms of original unit of data.
3. Between two variables (say x and y), two values of regression coefficient can be
obtained. One will be obtained when we consider x as independent and y as dependent
and the other when we consider y as independent and x as dependent. The regression
coefficient of y on x is represented as byx and that of x on y as bxy.
4. Both regression coefficients must have the same sign. If byx is positive, bxy will also
be positive and vice versa.
5. If one regression coefficient is greater than unity, then the other regression
coefficient must be lesser than unity.
6. The geometric mean between two regression coefficients is equal to the coefficient of
correlation, r =
7. Arithmetic mean of both regression coefficients is equal to or greater than coefficient
of correlation.
(byx + bxy)/2 = equal or greater than r

Regression coefficients are classified as:


(1) Simple, partial and multiple

(2) Positive and negative and

(3) Linear and non-linear.

Range:
The difference between the lowest and highest values.
In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6.
Range can also mean all the output values of a function.

Q. Mention two objective measures of dispersion


Range, inter quartile range, and standard deviation are the three commonly used
measures of dispersion.

Q. Write the properties of correlation coefficient

1) Correlation coefficient remains in the same measurement as in which the two variables
are.
2) The sign which correlations of coefficient have will always be the same as the variance.
3) The numerical value of correlation of coefficient will be in between -1 to + 1. It is known
as real number value.
4) The negative value of coefficient suggests that the correlation is strong and negative.
And if ‘r’ goes on approaching toward -1 then it means that the relationship is going towards
the negative side.
When ‘r’ approaches to the side of + 1 then it means the relationship is strong and positive.
By this we can say that if +1 is the result of the correlation then the relationship is in a
positive state.
5) The weak correlation is signaled when the coefficient of correlation approaches to zero.
When ‘r’ is near about zero then we can deduce that the relationship is weak.
6) Correlation coefficient can be very dicey because we cannot say that the participants are
truthful or not.
The coefficient of correlation is not affected when we interchange the two variables.
7) Coefficient of correlation is a pure number without effect of any units on it. It also not get
affected when we add the same number to all the values of one variable. We can multiply
all the variables by the same positive number. It does not affect the correlation coefficient.
As we discussed, ‘r ‘is not affected by any unit because ‘r’ is a scale invariant.
8) We use correlation for measuring the association but that does not mean we are talking
about causation. By this, we simply mean that when we are correlating the two variables
then it might be the possibility that the third variable may be influencing them.
Example 2: Calculate the Correlation coefficient of given data:

x 12 15 18 21 27

y 2 4 6 8 12
Solution:
Here n = 5

x 12 15 18 21 27

y 2 4 6 8 12

xy 24 60 94 168 324

x2 144 225 324 441 729

y2 4 16 36 64 144

∑x = 93
∑y = 32
∑xy = 670
∑x2 = 1863
∑y2 = 264
Now, substitute all the values in the below formula.

We have, r = 0.84

You might also like