Data Analysis and report Writing BRM
Data Analysis and report Writing BRM
report Writing
Data analysis
• Data analysis is the process of cleaning, changing, and processing raw
data and extracting actionable, relevant information that helps
businesses make informed decisions.
• The procedure helps reduce the risks inherent in decision-making by
providing useful insights and statistics, often presented in charts,
images, tables, and graphs.
Importance
• Better Customer Targeting
• Reduce operational Cost
• Better Problem-solving methods
• More accuracy
Process
• Data Collection
• Data Cleaning
• Data Analysis
• Data Interpretation
• Data Visualization
Data Editing
• Data editing is the application of checks to detect missing, invalid or
inconsistent entries or to point to data records that are potentially in
error.
• Data editing is carried out for the following reasons
• A respondent could have misunderstood a question.
• A respondent or an interviewer could have checked the wrong response.
• A coder could have miscoded or misunderstood a written response.
• An interviewer could have forgotten to ask a question or to record the answer.
• A respondent could have provided inaccurate responses.
• Some questions have been left blank.
Types of Data Editing
• Validity Edits
• Duplication Edits
• Consistency Edits
• Historical Edits
• Statistical Edits
Data Coding
• Process of assigning numerical values to responses that are originally
in a given format such as numerical, text, audio or video. The main
objective is to facilitate the automatic treatment of data for analytical
purposes.
• Data coding is the process of converting data into a form that can be
analyzed. It involves assigning numerical or categorical codes to data
items, such as responses to survey questions or demographic
information. Coded data can then be analyzed using statistical
software or other tools.
Types of Data Coding
• Nominal coding: This involves assigning labels or categories to data
items. For example, responses to a survey question about marital
status might be coded as follows: Single 1, Married 2, Divorced 3,
Widowed 4.
• Ordinal coding: This involves assigning categories to data items in a
specific order. For example, responses to a survey question about
satisfaction level might be coded as follows: Very dissatisfied = 1,
Frustrated = 2, Neutral = 3, Satisfied = 4, and Very satisfied = 5.
• Dichotomous coding: This involves assigning a binary code (e.g., 0 or 1)
to data items. For example, responses to a survey question about
gender might be coded as follows: Male = 0, Female = 1.
• Numeric coding: This involves assigning numerical values to data items.
For example, responses to a survey question about age might be coded
as follows: 18-24 years old = 1, 25-34 years old = 2, 35-44 years old = 3,
and so on.
• Derived variables: This involves calculating new variables based on
existing data. For example, a researcher might calculate the mean score
for a set of survey questions or create a new variable based on the sum
of several other variables.
• Truncation: This involves removing part of a data item. For example, a
researcher might truncate a part of the variable (e.g. recording the value
as 23, 47 etc., for measurement values 12.23, 12.47 etc.). This could be
helpful when the analysis is being performed manually.
Tabular representation of data
• Tabulation: The systematic presentation of numerical data in rows and
columns is known as Tabulation. It is designed to make
presentation simpler and analysis easier. This type of presentation
facilitates comparison by putting relevant information close to one
another, and it helps in further statistical analysis and interpretation
Objectives
• To make complex data simpler: The main aim of tabulation is to
present the classified data in a systematic way. The purpose is to
condense the bulk of information (data) under investigation into a
simple and meaningful form.
• To save space: Tabulation tries to save space by condensing data in a
meaningful form while maintaining the quality and quantity of the
data.
• To facilitate comparison: It also aims to facilitate quick comparison of
various observations by providing the data in a tabular form.
• To facilitate statistical analysis: Tabulation aims to facilitate statistical
analysis because it is the stage between data classification and data
presentation. Various statistical measures, including averages,
dispersion, correlation, and others, are easily calculated from data
that has been systematically tabulated.
• To provide a reference: Since data may be easily identifiable and
used when organised in tables with titles and table numbers,
tabulation aims to provide a reference for future studies.
Frequency Tables
• Frequency means the number of times a value appears in the data. A
table can quickly show us how many times each value appears.
• If the data has many different values, it is easier to use intervals of
values to present them in a table.
Here is the age of the 934 Nobel Prize winners up
until the year 2020. In the table each row is an age
interval of 10 years.
Age Interval Frequency
10-19 1
20-29 2
30-39 48
40-49 158
50-59 236
60-69 262
70-79 174
80-89 50
90-99 3
Univariate Analysis
• Univariate analysis is a basic kind of analysis technique for statistical
data. Here the data contains just one variable and does not have to
deal with the relationship of a cause and effect.
• Example: Survey of height of students of a class
Mean, Median, Mode
• Mean: The mean is calculated by adding up a group of numbers and
then dividing the sum by the count of those numbers.
• Median is the middle value in a group of numbers, which are
arranged in ascending or descending order, i.e. half the numbers are
greater than the median and half the numbers are less than the
median.
• Mode is the most frequently occurring value in the dataset. While the
mean and median require some calculations, a mode value can be
found simply by counting the number of times each value occurs.
Standard Deviation
• Standard deviation is a useful measure of spread for normal distributions.
• In normal distributions, data is symmetrically distributed with no skew. Most
values cluster around a central region, with values tapering off as they go
further away from the center. The standard deviation tells you how spread
out from the center of the distribution your data is on average.
• The mean (M) ratings are the same for each group – it’s the value on the x-
axis when the curve is at its peak. However, their standard deviations (SD)
differ from each other.
• The standard deviation reflects the dispersion of the distribution. The curve
with the lowest standard deviation has a high peak and a small spread, while
the curve with the highest standard deviation is more flat and widespread.
• The empirical rule
• The standard deviation and the mean together can tell you where
most of the values in your frequency distribution lie if they follow a
normal distribution.
• The empirical rule, or the 68-95-99.7 rule, tells you where your values
lie:
• Around 68% of scores are within 1 standard deviation of the mean,
• Around 95% of scores are within 2 standard deviations of the mean,
• Around 99.7% of scores are within 3 standard deviations of the mean.
Formula
• Calculation of standard deviation.xlsx
Bivariate Analysis –
• Bivariate analyses are conducted to determine whether a statistical
association exists between two variables, the degree of association if
one does exist, and whether one variable may be predicted from
another. For example, bivariate analyses could be used to answer the
question of whether there is an association between income and
quality of life, or whether quality of life can be predicted.
Cross Tabulation
• A cross tabulation (or crosstab) report is used to analyze the
relationship between two or more variables. The report has the x-axis
as one variable (or question) and the y-axis as another variable. This
type of analysis is crucial in finding underlying relationships within
your survey results. (or any type of data!)
E.g.
• We have a data of ten respondents as below
• Rank the two data sets. Ranking is achieved by giving the ranking '1'
to the biggest number in a column, '2' to the second biggest value and
so on. The smallest value in the column will get the lowest ranking.
This should be done for both sets of measurements.
Eg
Spearman's Rank Correlation Coeffic
ient.xlsx
• Find the value of all the d² values by adding up all the values in the
Difference² column.
• Multiply this by 6
• Now for the bottom line of the equation. The value n is the number of
sites at which you took measurements. This, in our example is 8.
• Substituting these values into n³ - n we get ….
• We now have the formula: R =
Interpretation
• The closer Rs is to +1 or -1, the stronger the likely correlation. A
perfect positive correlation is +1 and a perfect negative correlation is -
1. The Rs value of -0.73 suggests a fairly strong negative relationship.
• Using the actual mean method, calculate the standard deviation for
the data 3, 2, 5, and 6.
• Determine the standard deviation of the first 5 natural numbers.
• Compute Pearson’s coefficient of correlation between advertisement
cost (in 1000s) and sales (in lakhs) as per the data given below:
39 65 62 90 82 75 25 98 36 78
47 53 58 86 62 68 60 91 51 84
• The following data relates to the yield in grams(y) and the matured
pods (x) of 10 groundnut plants. Work out the correlation coefficient
14 34 20 16 11 11 20 17 22 17
16 40 21 18 14 13 20 35 17 27
CHI square Test
• Chi-square is a method that is used in statistics and it calculates the
difference between observed and expected data values. It is used to
find out how closely actual data fit with expected data. The value of
chi-square will help us to get the answer to the question as to the
significance of the difference in expected and observed data
statistically. A small chi-square value will tell us that any differences in
actual and expected data are due to some usual chance. And hence
the data is not statistically significant.
• Also, a large value will tell that the data is statistically significant and
there is something causing the differences in data.
Formula
While calculating (O-E) following
hints to be considered
• The sum of these differences always equals zero in each column.
• Each difference for sample A is matched by the same figure, but with
opposite sign, for sample B.
Degree of freedom
• Degrees of freedom refer to the maximum number of logically
independent values, which may vary in a data sample.
• Degrees of freedom are calculated by subtracting one from the
number of items within the data sample.
• Consider a data sample consisting of five positive integers. The values
of the five integers must have an average of six. If four items within
the data set are {3, 8, 5, and 4}, the fifth number must be 10. Because
the first four numbers can be chosen at random, the degree of
freedom is four.
Level of significance
• The level of significance is defined as the fixed probability of wrong
elimination of the null hypothesis when in fact, it is true. (Type I error)
• To measure the level of statistical significance of the result, the
investigator first needs to calculate the p-value. It defines the probability
of identifying an effect which provides that the null hypothesis is true.
• When the p-value is less than the level of significance (α), the null
hypothesis is rejected.
• If the p-value so observed is not less than the significance level α, then
theoretically null hypothesis is accepted. But practically, we often
increase the size of the sample size and check if we reach the
significance level.
The general interpretation of the p-
value based upon the level of
significance of 10%:
• If p > 0.1, then there will be no assumption for the null hypothesis
• If p > 0.05 and p ≤ 0.1, it means that there will be a low assumption
for the null hypothesis.
• If p > 0.01 and p ≤ 0.05, then there must be a strong assumption
about the null hypothesis.
• If p ≤ 0.01, then a very strong assumption about the null hypothesis is
indicated.
• The smaller the p-value, the stronger the evidence and hence, the
result should be statistically significant. Hence, the rejection of the
null hypothesis is highly possible, as the p-value becomes smaller.
Linear Regression Analysis
• Linear regression analysis is used to predict the value of a variable
based on the value of another variable. The variable you want to
predict is called the dependent variable. The variable you are using to
predict the other variable's value is called the independent variable.
• Linear regression fits a straight line or surface that minimizes the
discrepancies between predicted and actual output values.
• Linear Regression is useful for making future predictions by following
the current trends (predictive analysis)
Regression Coefficient
• a regression coefficient is used to find the value of an unknown
variable when the value of another variable is known.
• Linear regression, an important type of regression, is used when we
have to define a unit change in the independent variable affecting the
dependent variable by determining the equation of a straight line.
This is known as regression analysis.
• In linear regression, the main aim is to find the equation of a straight
line that best describes the relationship between two or more
variables.
Interpretation of regression
coefficient
• In order to make certain predictions about the unknown variable, we first
need to understand the nature of regression coefficients. This nature of
regression coefficients will help us check the extent of change in a dependent
variable with the effect of a unit change in the independent variable.
• Here are the regression coefficient’s interpretations:
• A positive sign of the regression coefficient explains a direct relationship
between the variables. This means that with an increase in the independent
variable, the dependent variable also increases, and vice versa.
• A negative sign of the regression coefficient explains an inverse relationship
between the variables. This means that with an increase in the independent
variable, the dependent variable also decreases, and vice versa.
Using the formula to find the regression coefficient: Steps to calculate linear regression.xlsx
Test of significance
• T-Test: A t-test is a statistical test that is used to compare the means
of two groups. It is often used in hypothesis testing to determine
whether a process or treatment actually has an effect on the
population of interest, or whether two groups are different from one
another.
• F-Test: F test is a statistical test that is used in hypothesis testing to
check whether the variances of two populations or two samples are
equal or not. In an f test, the data follows an f distribution. This test
uses the f statistic to compare two variances by dividing them.
• Z-Test: A z-test is a statistical test to determine whether two
population means are different when the variances are known and
the sample size is large.
• A z-test is a hypothesis test in which the z-statistic follows a normal
distribution.
• A z-statistic, or z-score, is a number representing the result from the
z-test.
Non-Parametric tests
• Binomial Test of Proportion: The Binomial test, sometimes referred to
as the Binomial exact test, is a test used in sampling statistics to assess
whether a proportion of a binary variable is equal to some
hypothesized value.
• ANOVA:
• One Way: One-way ANOVA is a hypothesis test that allows one to make
comparisons between the means of three or more groups of data.
• Two Way: Two-way ANOVA is a hypothesis test that allows one to make
comparisons between the means of three or more groups of data,
where two independent variables are considered.