BADM Material
BADM Material
Course Description
The present course is the need of the hour in present world. Business Analytics has become
a buzzword. According to India Industry report 2021, the business analytics market is
expected to grow by 21.2% by 2026 (Xpert, 2023).
Module 1
Module 2
Module 3
Module 4
Data Visualization
Data Visualization: Types of Presentation of Data – Graphical Presentation – Scatter plot,
Histogram; Diagrammatic Presentation – One Dimensional – Bar Charts – Simple, Sub-
divided and Multiple , Two Dimensional – Pie charts 2D and 3D, Other Charts – Box plots,
Line plots Using R Graphics and R Commander/R Deducer.
Table of Contents
Module 1
Module 2
Module 3
Module 4
Data Visualization in R
Unit 4.1 Introduction to Data Visualization
Unit 4.2 Graphical Presentation of Data
Unit 4.3 Diagrammatic Presentation of Data
Introduction to Business Analytics
MODULE – I
Why?
What is Business? A profitable Activity
What is Analytics? Set of Statistical tools.
What is Business Analytics? Analyzing Business issues using different
statistical tools.
In R, range(),max(),min(),sd(), var()
After dt, the p-value is 0.7455 > 0.05, accept NH -Data is normal.
In the entire process of Hypothesis testing, there is also an attempt to
reduce errors in hypothesis. Conventionally, there are two types
namely Type I Error and Type II Error. Later, conceptually extended to
Type III and IV Errors too. They are as follows;
Types of Errors in the Hypothesis
1. Type I Error –Alpha -–Level of Significance – Producer’s Surplus in
Economics
2. Type II Error –Beta (1-Beta=> Power of the Test)
3. Type III Error – Wrongly projecting Hypothesis
4. Type IV Error – Wrongly Interpreting the results
Ho is True | Ho is False
-------------------------------------------------------------
Accept H0: Correct Decision | Type II Error (Beta)
Reject H0: Type I Error(Alpha) | Correct Decision(1-Beta)
- Power of the test
--------------------------------------------------------------
Correct Hypothesis:: NH: Data is not Normal
AH: Data is Normal
dt<-c(10,11,12,15,14,16,14,17)
This section deals with the applications of Business Analytics in various domains namely
Marketing, Finance, Human Resources,Operations, They are as follows;
1. Marketing (CCDV):- Describing Demographics
Customer lifetime value(CLTV)
Cluster Analysis - Market Segmentation
Conjoint Analysis - New Product Development
(Control(2),Weight(3),Price(3), Color(4),PC(2))
Market basket Analysis -Sales,Cross-Selling
Demand / Sales Forecasting - Trend Analysis
2.Finance : - share price movements- ARIMA/ARCH/GARCH
-ARCH - AutoRegressive Conditional Heteroscedasticity
- Volatility - Engle (A noble
- GARCH - Generalized ARCH models- Bollerslev
-
SGARCH,TGARCH(Threshold),MGARCH,APGARCH(Asymptotic Power)
GJR GARCH,FGARCH...
- Prediction and for Assessment
- Standard Deviation for choosing the best project-A&B
3. Human Resources:- Employee Attrition ; Employee performance;
Employee Empowerment; Emotional
Intelligence;
Performance Appraisal;
4. Operations:- Raw material Procurement
Logistics- Transportation Cost
Linear Programing Problem(LPP)- For Max profits of
Mugs(x1) & Jugs (X2)
Max. Z=
3x1+4X2 - objective Function
Sub to Contraints:
2X1+5X2<=40 -- Raw material Contraint
3X1+3X2<=50 -- Labor hour Contraint
Where x1,x2>=0
Quality Control Charts - To measure the product Q.
- QCC for variables
QCC for attributes
- qcc package in R
Job Sequencing Problems
Assignment problems
BA is playing a vital roe in all the domains as well as in all industries
like Healthcare, FMCG, Agriculture, IT, Logistics,textile
Retail as naming a few.
i. for loop
Syntax: for ( var in range()){
Statement(s)
}
Example::
> # print 1 to 5 numbers
> for(i in 1:5){
+ print(paste("The number is",i))
+}
[1] "The number is 1"
[1] "The number is 2"
[1] "The number is 3"
[1] "The number is 4"
[1] "The number is 5"
R language R Studio
Summary
Analyzing the data with set of statistical tools is referred as Business Analytics.
There are five types of analytics namely Descriptive Analytics, Diagnostic Analytics,
Predictive Analytics, Prescriptive Analytics and Cognitive Analytics.
Basic Analytics includes the first two types and the rest of them comes under
Advanced Business Analytics.
Descriptive Analytics helps in describing the behavior of data and Diagnostic
Analytics helps in knowing the reasons for a situation.
Explains the data types (like numeric, logic, character, complex, integer and raw),
data operators (Assignment, Arithmetic, Relational, logical), data structures(scalar,
vector, matrix, array, list, dataframe, factor) functions, conditional(if, if-else, nested if-
else, switch) and looping statements(for, while and repeat) in R.
Finally, a table has been provided to understand the way to choose a test based on
type of variable and normality.
Terminal Questions
1. Explain the significance of Business Analytics.
2. Elaborate the types of Business Analytics.
3. Illustrate examples for data types in R.
4. Illustrate examples for data operators in R.
5. Illustrate examples for data structures in R.
6. Distinguish user-defined functions from in-built functions in R.
7. Provide a chart for identifying the test based on variables and normality.
Activity
Glossary
Analytics: Analyzing data using a set of statistical tools
Business Analytics: Analysing business data using set of statistical tools.
Descriptive Analytics: Analytics that helps in describing the behavior of data
Diagnostic Analytics: Analytics that help in exploring the reasons behind the behavior of data
Data type :Expresses the nature of an R Object.
Data Structure: Describes the organization of data in an R object
Function: Set of Statements
Bibliography
Xpert, A. (2023, March 10). Admissions Xpert. Retrieved March 10, 2023, from
https://admissionxpert.in/careers-in-business-analytics/
e-References
https://www.simplilearn.com/types-of-business-analytics-tools-examples-jobs-article
https://www.statmethods.net/r-tutorial/index.html
https://www.datacamp.com/tutorial/data-types-in-r
https://www.w3schools.com/r/default.asp
Video links
Topic Link
Business Analytics https://www.youtube.com/watch?
&R v=eDrhZb2onWY&list=PL9ooVrP1hQOEIUTpxRf4infBJnquwaTME
Basics of R https://www.youtube.com/watch?v=5hqm1ItZjyo
MODULE – II
In descriptive analytics, there are four measures (Cooksey, 2020) that help in analyzing the
behavior of data. They are,
i. Measures of Central Tendency (MOCT)
ii. Measures of Dispersion (MOD)
iii. Measures of Skewness (MOsk.)
iv. Measures of Kurtosis (MOku)
Example:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> # Extract mpg from mtcars dataset
> # Install DescTools package to compute Mean(), Median() and Mode()
> attach(mtcars) # attach variables of mtcars dataframe to the R Environment
> library(DescTools)
Warning message:
package ‘DescTools’ was built under R version 4.2.2
> Mean(wt
[1] 3.21725 ) # Average weight of 32 cars is 3.21725(1000 lbs; 1lbs=0.45kgs)
> Median(wt)
[1] 3.325
# 50% of the cars are having an average of 3.25 (1000 lbs=0.45kgs) and below.
> Mode(wt)
[1] 3.44 # Most of the cars having an average wt of 3.44 (1000 lbs=0.45) i.e., 3 cars.
attr(,"freq")
[1] 3
From the above mean, median and mode results, the distribution seems to be negatively
skewed. In this case, it may not be sure as the mean<median, not close to each other to
assure negative skewed.
For example, relative measures are used to compare height of children measured
in cms and feet in a better way than absolute measures.
In this session, few measures like Range, Quartiles, Standard Deviation and
variance are used in R.
Example:
> # MOD
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> range(mpg)
[1] 10.4 33.9
> rg<-max(mpg)-min(mpg)
> rg
[1] 23.5
> quantile(mpg,0.20)
20%
15.2
> quantile(mpg,0.50)
50%
19.2
> quantile(mpg,0.80)
80%
24.08
> # Standard Deviation of Project A and B
> # Let us assume xyz company has to take up any one project among A and B
with 4 year returns
> pa<-c(10,5,35,10)
> sum(pa)
[1] 60
> pb<-c(15,20,10,15)
> # Intially to know which project to accept, mean of two projects is computed as
follows;
> mean(pa)
[1] 15
> mean(pb)
[1] 15
> # To better identify which project to select use standard deviation(sd)
> sd(pa)
[1] 13.54006
> sd(pb)
[1] 4.082483
> # From the above two project pa and pb, pb is chosen being more consistent in
its return.
- This measure helps in understanding the distribution of data whether the data is
positively skewed or negative skewed one. Theoretically, if skewness is zero, the
data is expressed as normally distributed. In general, skewness may not be zero
exactly, so as per the statisticians, if the skewness values are in between -1 to +1
or -0.5 to +0.5, the data is considered to be normal.
Example in R
> # Skewness
> library(moments)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> # check the skewness of mpg
> # Attach the variables of mtcars to the environment using attach()
> attach(mtcars)
> skewness(mpg)
[1] 0.6404399
> # Above result is positive, means positively skewed
> # To test its statistical significance, use D'Agostino test
> # NH: Data have no Skewness
> # AH: Data have a Skewness
> agostino.test(mpg)
data: mpg
skew = 0.64044, z = 1.63510, p-value = 0.102
alternative hypothesis: data have a skewness
> # from the above result it is evident that the p-value is 0.102> 0.05,faile to reject
NH.
> # It is concluded that the data have no skewness means the variable mpg data
is normal.
This is the last measure that helps in understanding the distribution of data .
Here, kurtosis talks about the change in the pickedness of the curve. There are
three types of kurtosis. They are,
1. Platykurtic – where k < 3 (Kurtosis) or < 0 (Excess Kurtosis)
2. Mesokurtic – where k=3 (Kurtosis) or =0 (Excess Kurtosis)
3. Leptokurtic – where k > 3(Kurtosis) or >0 (Excess Kurtosis)
Usually, the requirement is that the data should follow mesokurtic to be a normal
distribution. In practical, the value of kurtosis can never be 3, so statisticians gave
a relaxation that if the value of k is around 2, then that data can be considered to
be normal.
Example:
> # Kurtosis
> library(moments)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> attach(mtcars)
> # let us compute kurtosis for mpg
> kurtosis(mpg)
[1] 2.799467
> # From the above value, it is evidient that the value is 2.79,almost close to 3
> # This confirms that the data is mesokurtic, means the data seems to be
normal.
> # To confirm the result is it statistically significant or not, compute Anscombe
test.
> anscombe.test(mpg)
data: mpg
kurt = 2.79947, z = 0.20148, p-value = 0.8403
alternative hypothesis: kurtosis is not equal to 3
In this unit, the concepts cover Hypothesis testing as the diagnosis is happening for the
population to draw certain conclusions based on the sample. As a part of Diagnostic
Analytics, it covers inferential statistics along with the descriptive analytics too. As a part of
Hypothesis testing, different data comes with different levels of measurement namely
nominal, ordinal and scale (Interval or ratio). Based on the level of measurement, the
statistical tests are developed to handle them. As discussed earlier, there are four levels of
measurement (Morris & Sheedy, 2022). They are as follows;
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Usually, the nominal data is used for identification or categorization or for classification.
Some examples are Gender, Marital Status, Colors, Names of places etc., In this data the
categories are not comparable. The moment the categories become comparable, the data
becomes ordinal data like in Educational Qualification where PG students are more qualified
than those in UG. Therefore, ordinal is also referred as ordered categorical or ranked data.
Further, if the distance between these categories is same, the data is termed as Interval but
without absolute Zero like in temperature except Kelvin. Finally, the data with absolute zero
comes under Ratio data like income, sales, profits, marks etc.,.
At present, the focus is on nominal data whose tests are well taken in the following unit.
The moment nominal tests comes in to the mind, the data is categorical. Analyzing
categorical data is the basic purpose of nominal tests. These nominal tests are classified as
follows;
1. Nominal test for one sample group
2. Nominal test for two sample groups
3. Nominal test for morethan two sample groups
4. Nominal test for paired response
5. Nominal test for repeated response
6. Nominal test for measuring the relationship between two variables
Examples 1:
> # Binomial Test - Used to measure the significant distribution of proportions of a
variable with 2 categories.
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> dt<-table(mtcars$vs) # Extracting vs from mtcars dataset using $ symbol
> dt
0 1
18 14
> # NH: 50% of the cars having V-shaped Engine.
> # AH: 50% of the cars are not having V-shaped Engine.
> binom.test(dt)
data: dt
number of successes = 18, number of trials = 32, p-value = 0.5966
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3766257 0.7363619
sample estimates:
probability of success
0.5625
> # As the p-value is 0.5966 > 0.05, concludes that the result could not reject NH,
means there are 50% of cars in v-shaped Engine.
> # mtcars datasets are having v-shaped engine. In the result, 56.25 % of the
cars are v-shaped only i.e., nearer to 50%.
data: dtf
number of successes = 12, number of trials = 15, p-value = 0.03516
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5191089 0.9566880
sample estimates:
probability of success
0.8
> # As the p-value is 0.03516 < 0.05, rejects NH and accepts the AH.\
> # This indicates that proportion of pass is not equal to 50%. Now, the prob. of
success(pass%)is 80% i.e., not equal to 50%.
For suppose, extending the same example by adding detained students too to
the list of 15, making it 20 with 3 categories namely ‘pass’,’fail’, and ‘detained’.
The suitable test for this purpose it multinomial test with multinom_test() fron
rstatix packages.
> # Example
> ps<-c('pass','pass','fail','fail','pass','pass','pass','fail','pass','pass','pass','pass',
'pass', 'pass','pass','detained','detained','pass','pass','pass')
> length(ps)
[1] 20
> dt<-table(ps)
> dt
ps
detained fail pass
2 3 15
> dtv<-c(2,3,15)
> # let us apply multiom_test() after loading rstatix package
> library(rstatix)
> multinom_test(dtv)
# A tibble: 1 x 2
p p.signif
* <dbl><chr>
1 0.000919 ***
> # NH: Prportion of pass,fail and detained is equal.
> # AH: Prportion of pass,fail and detained is not equal.
> # From the above result, it is evident that the study could reject NH.
> # This means that the propotion of students passed, failed and detained is not
equal.
> # To know which category is having major proportion, proceed for post-hoc test
i.e. pairwise_binom_test()
> # use same package rstatix
> pairwise_binom_test(dtv)
# A tibble: 3 x 9
group1 group2 n estimate conf.low conf.high p p.adj p.adj.signif
* <chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><chr>
1 grp1 grp2 5 0.4 0.0527 0.853 1 1 ns
2 grp1 grp3 17 0.118 0.0146 0.364 0.00235 0.00705 **
3 grp2 grp3 18 0.167 0.0358 0.414 0.00754 0.0151 *
> # From the above result, it is clear that the proportion of pass and fail are not
significantly different.
> # Second, the proportion of group1 and group 3 & group 2 and group 3 are
significantly different.
> # Finally, from the frequency table it is evident that pass proportion is
significantly high from other groups.
In this context, we will be having two variables with internal categories of two
each like gender(2 categories- male and female) and sections(secA and secB).
Usually, to address this context, chi-square test is used. It has four major applications.
They are as follows;
I. Can be used for Goodness-of-fit
II. Can be used for Testing the Independence or Association between two categorical
variables (suitable for 2.3.3 context).
III. Can be used for Testing the Homogeneity
IV. Assessing the population variance from sample variance(Beyond the scope).
Let us have simple examples for the above applications of chi-square test.
I. Goodness-of-fit
This concept states that whether the collected data is having a particular
distribution or not. Here, the distribution of students across the three sections is
tested assuming a uniform distribution of 180 students. The obtained data is
70,30,80. Testing whether this data follows a uniform distribution or
not(i.e.,60,60,60).
data: strength
X-squared = 23.333, df = 2, p-value = 8.575e-06
data: strength
X-squared = 3.3333, df = 2, p-value = 0.1889
> # As p-value is 0.1889>0.05, result fails to reject NH, means the data follows a Uniform
distribution.
data: m
X-squared = 3.0791, df = 1, p-value = 0.07931
> # As the p-value is 0.07931 > 0.05, fails to reject NH, means No Association between
Gender and Sections.
> info<-c(50,20,20,60)
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB')))
>m
secA secB
male 50 20
female 20 60
> # NH: No Association between Gender and section
> # AH: An Association between Gender and section
> chisq.test(m)
data: m
X-squared = 30.496, df = 1, p-value = 3.346e-08
> # As the p-value is 0.00000003346 < 0.05, rejects NH, means there is an Association
between Gender and Sections.
> # Here, as the cell frequencies are more than 5, no need of continuity correction.
> # you can apply this by applying correct = F
> chisq.test(m,correct=F)
Pearson's Chi-squared test
data: m
X-squared = 32.334, df = 1, p-value = 1.298e-08
> # You can observe the p-value increased to 0.00000001298 from 0.00000003346, which
inturn reduces power of the test.
> # In this case, the value of the result has changed but still holds the same outcome that
there is an association between gender and sections.
data: m
X-squared = 30.496, df = 1, p-value = 3.346e-08
> # As the p-value is significant, this indicates that distribution of male and female between
secA and secB is not same.
> # Even, more male are in secA and female are in secB.
This scenario moves with two applications namely Testing of Independence or Association
and Testing of Homogeneity. Here, one of the two variables have intenal categories more
than 2.
I. Testing of Independence or Association
In extension to the previous example, let us have their divison gender-wise across the 3
sections. Here, the objective is to verify the association between gender and section.
> info<-c(40,30,20,10,40,40)
> # make a 2*3 matrix with info
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB','secC')))
>m
secA secB secC
male 40 20 40
female 30 10 40
> # Input for chi-square test
> # NH: There is no association between Gender and Sections of Students.
> # AH: There is an association between Gender and Sections of Students.
> chisq.test(m)
data: m
X-squared = 2.5714, df = 2, p-value = 0.2765
> # As the p-value is 0.2765 > 0.05, it fails to reject the NH.
> # This means that there is no assocaition between Gender and Sections of students.
>
> # Let us change the info to c(50,20,15,15,20,60)
> info<-c(50,20,15,15,20,60)
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB','secC')))
>m
secA secB secC
male 50 15 20
female 20 15 60
> # Let us apply chi-square test
> chisq.test(m)
data: m
X-squared = 32.402, df = 2, p-value = 9.206e-08
Note: There is a case where Chi-square is not applicable. If 20% of the cells are having a
cell frequency of less than 5, still you can apply chi-square test but if in a scenario where
more than 20% of the cells are having a cell frequency less than 5, then proceed for
Fisher’s Exact test using fisher.test() in R.
data: m
X-squared = 32.402, df = 2, p-value = 9.206e-08
> # As the p-value is 0.00000009206 < 0.05, reject NH, means Accepting AH.
> # This indicates that the distribution of male and female is not same, means no
Homogeneity.
> R1
[1] "pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass"
"pass""pass""pass"
> R2
[1] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail"
"fail"
> dt<-table(R1,R2)
> dt
R2
R1 fail pass
fail 1 2
pass 11 1
> mcnemar.test(dt)
data: dt
McNemar's chi-squared = 4.9231, df = 1, p-value = 0.0265
> # As the p-value is 0.0265 <0.05, it is evident that the result rejects the null
hypothesis.
> # There is an association between R1 and R2. In other words, the R1 and R2 are
not independent.
> # Association in the sense that where the students performed well in R1, they could
not perform well in the R2.
> # Exploring this reason will be helpful for student betterment.
Example: Continuing the earlier example, let us add one more Round R3. No, the analysis
demands the researcher to analyse the effect of type of Round on the students
performance. In other words, the analysis is check whether the performance of students in
all the 3 rounds is same or different.
> R1
[1]
"pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""pass"
> R2
[1] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail""fail"
> R3
[1]
"pass""pass""pass""pass""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""fail"
> Outcome<-c(R1,R2,R3)
> Outcome
[1]
"pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""pass"
[16] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail""fail"
[31]
"pass""pass""pass""pass""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""fail"
> treatment<-c(rep(1,15),rep(2,15),rep(3,15))
> treatment
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
> participant<-c(1:15,1:15,1:15)
> participant
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2
3 4 5 6
[37] 7 8 9 10 11 12 13 14 15
> outcome<-Outcome
> dt<-data.frame(outcome,treatment,participant)
> str(dt)
'data.frame': 45 obs. of 3 variables:
$ outcome : chr "pass""pass""fail""fail" ...
$ treatment : num 1 1 1 1 1 1 1 1 1 1 ...
$ participant: int 1 2 3 4 5 6 7 8 9 10 …
> # NH: Performance of Students is same in all the 3 rounds of Interview.
> # AH: Performance of Students is not same in all the 3 rounds of Interview.
>library(rstatix)
> cochran_qtest(dt,outcome~treatment|participant)
# A tibble: 1 x 6
.y. n statistic df p method
* <chr><int><dbl><dbl><dbl><chr>
1 outcome 15 13 2 0.00150 Cochran's Q test
> # As per the p-value, it is evident that it rejects the NH, means the performance of students
differ in 3 rounds. To know where the difference is, proceed for post-hoc test i.e.,
pairwise_mcnemar_test() in rstatix package.
> pairwise_mcnemar_test(dt,outcome~treatment|participant)
# A tibble: 3 x 6
group1 group2 p p.adj p.adj.signif method
* <chr><chr><dbl><dbl><chr><chr>
11 2 0.0265 0.0795 ns McNemar test
21 3 1 1 ns McNemar test
32 3 0.00937 0.0281 * McNemar test
2.3.6 Nominal test for measuring the relationship between two variables
In this aspect, there comes a scenario where the reseacher would like to know the
relationship between two nominal variables, then proceed for phi coefficient of correlation
(varies from 0 to 1).
Example:
In this module, the learner learns about four measures of descriptive statistics namely
measures of central tendency(or location), measures of dispersion(spread or variability),
measures of skewness and kurtosis(shape).
Discussed the usage of Nominal tests on one categorical variable with two categories I.e.,
binomial test, one categorical variable with morethan 2 categories I.e., multinomial test.
Lastly, phi coefficient of correlation is used to measure the relationship between two
variables.
Terminal Questions
Answer Keys :
1 2 3 4 5 6 7 8 9 10
C A B B B d c c b d
Activity
Create a data set with three rounds of interview and perform mcnemar’s as well as
cochran’s q test .
Use mtcars data set to perform the association test of chisquare test.
Glossary
Bibliography
Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.
Morris, S., & Sheedy, M. (2022). General Mathematics. John Wiley & Sons.
e-References
https://www.simplilearn.com/types-of-business-analytics-tools-examples-jobs-article
https://www.statmethods.net/r-tutorial/index.html
https://www.datacamp.com/tutorial/data-types-in-r
https://www.w3schools.com/r/default.asp
Video links
https://www.youtube.com/watch?
v=WbKiJe5OkUU&list=PLFW6lRTa1g83jjpIOte7RuEYCwOJa-6Gz
https://www.youtube.com/watch?v=niJvj7116Kk
Image Credits : NA
Keywords
mean, median, mode, standard deviation, variance, binomial, multinomial, chi-square,
mcnemar, cochranQ, phi coefficient of correlation.
Ordinal and Scale Tests
MODULE – II
Module Description
In this module, the learning starts with the understanding of ordinal and scale data followed
by their tests. Initially, the data is tested for normality and based this result, the selection of
of parametric or non-parametric tests is done. Then, the chosen test is used for analyzing
the objective left behind the collection of that particular data.
Aim
To understand the application of ordinal and scale test when the data is normal or not
normal.
Instructional Objectives
This module includes:
Application of Ordinal tests
Application of Scale tests
Learning Outcomes
Able to apply ordinal tests when the data is not normal or ordinal
Able to apply scale tests when the data is normal
Able to identify which test to be used to which type of data.
Unit 3.1 Ordinal Tests
Generally, the data collected from the survey or any other source will be either categorical or
continuous in nature. Initially, if the data is ordered categorical, ordinal tests are directly
applied. Whenever the data won’t meet the assumption of normality, these tests are applied
for continuous variables too. So, the ordinal tests are treated as equivalent non-parametric
tests (Kloke & McKean, 2015) for parametric tests when the data is not normal.
data: Profits
W = 0.91617, p-value = 0.04809
> # As p-value is 0.04809 < 0.05, reject NH that the data is normal.
> # Proceed for non-parametric test
> # Equivalent non-parametric test is Wilcoxon signed rank test
> # syntax: wilcox.test(vector,mu,alternative)
> # NH: Company's profits is not satisfactory (i.e., mu(profits) <=10)
> # AH: Company's profits is satisfactory (i.e., mu(profits) >10)
> wilcox.test(Profits,mu=10,alternative='greater')
data: Profits
V = 116, p-value = 0.7541
alternative hypothesis: true location is greater than 10
Warning messages:
1: In wilcox.test.default(Profits, mu = 10, alternative = "greater") :
cannot compute exact p-value with ties
2: In wilcox.test.default(Profits, mu = 10, alternative = "greater") :
cannot compute exact p-value with zeroes
> # From the above result, p-value is 0.07553 > 0.05, fails to reject NH.
> # This indicates that accepting AH means Region has no affect on Profits.
> # To know Profts Region-wise use aggregate function
> aggregate(Profits,by=list(Region),median)
Group.1 x
1 Guntur 12.0
2 Vijayawada 4.5
> aggregate(Profits,by=list(Region),mean)
Group.1 x
1 Guntur 11.25
2 Vijayawada 7.25
By looking at the above mean and median values, there is a enough difference in their sales
or profits but not statistically significant, so the results are as obtained above.
Note: As per the warning, if there are more ties in the data or ranks, Jonckheere-Terpstra
test is applicable. Available in DescTools, PMCMRplus packages in R, as naming a few.
From the above result, it is evident that there is no significant difference in the mean scores
or median scores of BrandA and BrandB much. But, in the case of BrandA and BrandC,
BrandB and BrandC there is lot of statistically significant difference between them at 5 %
level of significance.
Futher, to know which Brand did well, use aggregate to get the numbers for comparison as
follows;
> aggregate(Profits,by=list(Brands),median)
Group.1 x
1 BrandA 11.5
2 BrandB 12.5
3 BrandC 4.0.
Finally, it is evident that the Brand B is performing better than other two brands namely
BrandA and BrandC.
data: BeforeT
W = 0.81791, p-value = 0.02391
> shapiro.test(AfterT)
data: AfterT
W = 0.93388, p-value = 0.4872
Purpose: To analyze repeated responses of the same subject or respondent. In this context,
the study is on analyzing the performance of students in three tests.
Package: ‘stats’ for friedman, ‘PMCMRplus’ for post-hoc tests of friedman.
Function: friedman.test()
Example:
> # Friedman Test
> dt<-read.csv(file.choose()) # Save the dataset in csv and import it.
> str(dt)
'data.frame': 30 obs. of 3 variables:
$ Outcome : int 10 10 14 16 13 12 10 10 10 12 ...
$ Treatment : int 1 1 1 1 1 1 1 1 1 1 ...
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
> # Check whether this data is normal or not.
> library(rstatix)
> dt %>% group_by(Treatment) %>% shapiro_test(Outcome)
# A tibble: 3 × 4
Treatment variable statistic p
<int> <chr> <dbl> <dbl>
1 1 Outcome 0.818 0.0239
2 2 Outcome 0.934 0.487
3 3 Outcome 0.832 0.0352
> # AS the p-values of two Treatments namely 1 and 3 is not normal, proceed for non-
parametric test .
> # The context of repeated responses recommends fiedman test as alternative to Repeated
measures ANOVA.
> # Apply freidman test
> # NH: Performance of students in three tests is same.
># AH: Performance of students in three tests is not same.
> friedman.test(Outcome~Treatment|Participant,data=dt)
1 2
2 0.0273 -
3 0.0001 0.2608
Pairwise comparisons using Conover's all-pairs test for a two-way balanced complete
block design
> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.
> frdAllPairsSiegelTest(y=Outcome,groups=Treatment,blocks=Participant)
1 2
2 0.02025 -
3 0.00011 0.11752
> frdAllPairsMillerTest(y=Outcome,groups=Treatment,blocks=Participant)
Pairwise comparisons using Miller, Bortz et al. and Wike all-pairs test for a two-way
balanced complete block design
1 2
2 0.03665 -
3 0.00019 0.29376
> frdAllPairsExactTest(y=Outcome,groups=Treatment,blocks=Participant)
Pairwise comparisons using Eisinga, Heskes, Pelzer & Te Grotenhuis all-pairs test with
exact p-values for a two-way balanced complete block design
data: y, groups and blocks
1 2
2 0.016 -
3 2.1e-06 0.148
Overall, from all the above five post-hoc tests, the common fact is students’ performed
equivalently in Test2 and Test3, differently in Test1.
To know in which test they performed better, have aggregate scores computed with median,
as follows;
> aggregate(Outcome,by=list(Treatment),median)
Group.1 x
1 1 11
2 2 17
3 3 19
Finally, it is evident that students outperformed in Test 3 and least performed in Test1.
data: Sales
W = 0.92128, p-value = 0.06236
> shapiro.test(Profits)
data: Profits
W = 0.90336, p-value = 0.02538
From the above results, it is clearly evident that between Sales and Profits, Sales
data is normally distributed, whereas Profits is not. So, proceed for non-parametric
correlation test i.e., Spearman Rank correlation equivalent to Karl Pearson’s
coefficient of correlation.
> # NH: No correlation between Sales and Profits
> # AH: A Correlation between Sales and Profits
> cor(Sales,Profits,method='spearman')
[1] 0.9360426
> cor.test(Sales,Profits,method='spearman')
Warning message:
In cor.test.default(Sales, Profits, method = "spearman") :
Cannot compute exact p-value with ties
The p-value is 0.0000000000184 < 0.05, rejects NH, concludes that there is
statistically significant positive strong correlation based on value (i.e., 0.9360426)
too.
Note: If there are ties, proceed for kendall tau (Sharma, 2018). Here, kendall tau has three
versions namely tau a, tau b and tau c. tau a is for data without ties, tau b is for data with ties
having ordinal and interval data, tau c is for handling ties and missing observations too for
ordinal and interval data too.
These are used for continuous variables in association with categorical variables in
performing particular tests. As these tests use parameters, they consider all the data for
analysis which makes it original and reliable. Even, whenever the data is not normal, the
people try to make the data normal using Data transformation too. Its results are more
reliable. Suppose if the data collected is continuous in nature, it has to meet normality
assumption in order to apply parametric tests (Davis, 2022) else non-parametric tests are
recommended. Most of the times, the attempt of a researcher is to use parametric than non-
parametric wherever applicable because of its nature of including all observations in the
analysis.
> dt
Sales Region Brands Profits YoE
1 12 Guntur BrandA 8 5
2 14 Guntur BrandA 10 5
3 15 Guntur BrandA 11 6
4 17 Guntur BrandA 12 6
5 18 Guntur BrandA 12 8
6 18 Guntur BrandA 12 8
7 20 Guntur BrandA 13 8
8 13 Guntur BrandA 9 7
9 17 Guntur BrandB 9 5
10 19 Guntur BrandB 15 6
11 16 Guntur BrandB 12 7
12 18 Guntur BrandB 12 6
13 19 Vijayawada BrandB 14 6
14 21 Vijayawada BrandB 16 10
15 14 Vijayawada BrandB 12 6
16 17 Vijayawada BrandB 13 6
17 8 Vijayawada BrandC 7 5
18 9 Vijayawada BrandC 5 5
19 10 Vijayawada BrandC 4 6
20 7 Vijayawada BrandC 4 7
21 8 Vijayawada BrandC 3 7
22 6 Vijayawada BrandC 3 4
23 5 Vijayawada BrandC 2 4
24 7 Vijayawada BrandC 4 3
data: Sales
W = 0.92128, p-value = 0.06236
From the above result it is evident that the p-value is 0.06236 > 0.05, fails to
reject NH, so the data is normal.
Now, proceed for One Sample T-test in order to meet the purpose of evaluating the
performance of 24 stores of AKP company.
data: Sales
t = -1.3083, df = 23, p-value = 0.8982
alternative hypothesis: true mean is greater than 15
95 percent confidence interval:
11.91999 Inf
sample estimates:
mean of x
13.66667
> From the above result, it is evident that the p-value is 0.8982 > 0.05, means fails to
reject NH, means the company AKP is not performing well.
Purpose: To analyze the impact of one categorical variable on the other continuous
variable.
In this context, to analyze the impact of Region (Categorical) on Sales(Continuous) of the
Company.
Package: stats
Function: t.test()
Example: To analyze the performance (in sales in lakhs) of AKP company
Assumptions:
There are two basic Assumptions.
1. Normality - Shapiro Wilks test
2. Homogeneity – Bartlett’s test or Levene’s Test
> # As p-value is 0.014447, concludes that the study could reject the NH, means variances
are not equal.
> # Choose a two sample parametric test where variances are unequal -Welch Two sample
test (default test in t test in R)
> # Hypothesis for welch two sample test
> # NH: No impact of Region on Sales
> # AH: An impact of Region on Sales
> # Other way of expressing the same hypothesis is as follows
> # NH: Average sales at Guntur is equal to the Average sales at Vijayawada
> # AH: Average sales at Guntur is not equal to the Average sales at Vijayawada
> t.test(Sales~Region,data=dt) # default two sample test is welch two sample test
Note: If homogeneity exists, include argument as var.equal=T, provides Two Sample T-test
results.
3.2.3 One Way ANOVA (Morethan 2 Sample Groups)
In order to apply this test, it has to meet the following assumptions.
1. Normality by Shapiro-Wilks test
2. Homogeneity by Bartlett’s or Levene’s test
3. Independence of Observations – Assuming the data is collected independently.
Purpose: To analyze the impact of a categorical variable (2 or more categories) on a
continuous variable.
In our context, to analyze the impact of type of Brand(A,B,C) on the Sales of the company.
Package: stats
Function: aov()
Example:
> # ONe-Way ANOVA
> help(aov)
> head(dt)
Sales Region Brands Profits YoE
1 12 Guntur BrandA 8 5
2 14 Guntur BrandA 10 5
3 15 Guntur BrandA 11 6
4 17 Guntur BrandA 12 6
5 18 Guntur BrandA 12 8
6 18 Guntur BrandA 12 8
> # Impact of Brands on Sales of the company.
> # 1. Normality
> library(rstatix)
># NH: data is normal
># AH: data is not normal
> dt %>% group_by(Brands) %>% shapiro_test(Sales)
# A tibble: 3 × 4
Brands variable statistic p
<fct> <chr> <dbl> <dbl>
1 BrandA Sales 0.954 0.751
2 BrandB Sales 0.981 0.966
3 BrandC Sales 0.983 0.975
># In the above result, as all p-values of Brands > 0.05, groups data is normal.
> # Proceed for a parametric test i.e., ANOVA.
> # Being only one Independent variable, it is referred as One-Way ANOVA.
> # let us test Homogeneity too.
> # 2. Homogeneity
> # NH: Variances are equal (i.e., Homogeneity exists)
> # AH: Variances are not equal (i.e., Homogeneity does not exist)
> bartlett.test(Sales~Brands)
$Brands
diff lwr upr p adj
BrandB-BrandA 1.750 -1.064726 4.564726 0.2814838
BrandC-BrandA -8.375 -11.189726 -5.560274 0.0000007
BrandC-BrandB -10.125 -12.939726 -7.310274 0.0000000
> # From the above result based on the diff and p adj, we have to conclude which brand is
better.
> # If you observe BrandB-BrandA,their diff value is positive means BrandB's mean >
BrandA's mean
> # If you observe BrandC-BrandA,their diff value is negative means BrandC's mean <
BrandA's mean
> # If you observe BrandC-BrandB,their diff value is negative means BrandC's mean <
BrandB's mean
> # In short, BrandB's mean >BrandA's mean >BrandC's mean; Brand B is better performer.
> # To confirm, compute aggregate
> aggregate(Sales,by=list(Brands),mean)
Group.1 x
1 BrandA 15.875
2 BrandB 17.625
3 BrandC 7.500
Finally, from the above result, it is evident that BrandB is performing better than BrandA and
BrandC respectively.
3.2.4 Paired Sample T-test (Paired Responses)
In order to apply this test, paired response of respondents is needed.
Purpose: To analyze the paired response of respondents i.e., before and after, Test1 and
Test2 marks, as naming a few.
Purpose: To analyze the paired response of respondents.
Package: stats
Function: t.test(…,paired=T….)
Example:
> # Paired sample t-test
> # This test for those who give paired response.
> dt<-read.csv(file.choose())
> dt
BeforeT AfterT
1 10 15
2 12 16
3 14 17
4 16 20
5 13 16
6 12 19
7 11 18
8 10 17
9 10 16
10 12 19
> str(dt)
'data.frame': 10 obs. of 2 variables:
$ BeforeT: int 10 12 14 16 13 12 11 10 10 12
$ AfterT : int 15 16 17 20 16 19 18 17 16 19
> # Assumption
> # 1. Testing Normality
> # if normality exists proceed for Paired sample t-test
> # use t.test() from stats package
> attach(dt) # attaching variables to R Environment
> shapiro.test(BeforeT)
data: BeforeT
W = 0.89749, p-value = 0.2056
> shapiro.test(AfterT)
data: AfterT
W = 0.93388, p-value = 0.4872
Paired t-test
> # From the above result, it is evident as the p-value is 0.000002043 < 0.05,rejects NH.
> # Therefore, it is observed that the Avg marks of students before training is lessthan
> the Avg marks of students after training.It can be concluded that the Training is
effective.
$ANOVA
Effect DFn DFd F p p<.05 ges
1 Treatment 2 18 28.547 2.61e-06 * 0.644
$`Sphericity Corrections`
Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF] p[HF]<.05
1 Treatment 0.859 1.72, 15.47 1.11e-05 * 1.043 2.09, 18.77 2.61e-06 *
The above results indicate that there is a significant difference in the performance of
students from Test 1 to Test 2 & Test1 to Test 3 but not from Test2 to Test 3.
> aggregate(dt$Outcome,by=list(dt$Treatment),mean)
Group.1 x
1 1 13.0
2 2 17.3
3 3 18.5
Finally, from the above mean scores, it is evident that students performed equivalently in test
2 and Test3, better than Test 1. Strictly speaking, students performed better in Test3 and
worst in Test1.
The correlation(r) value varies from -1 to +1 where -1 indicates perfectly negative correlation,
+1 indicates perfectly positive correlation and 0 indicates no correlation. The following are
the levels of correlation to be followed in interpreting the results.
If r is in between 0 and 0.3, there is a weak positive correlation
r is in between 0.3 and 0.7, there is a moderate positive correlation
r is in between 0.7 and 1, there is a strong positive correlation.
These ranges are applicable in negative direction too.
data: Sales
W = 0.92128, p-value = 0.06236
> shapiro.test(YoE)
data: YoE
W = 0.95219, p-value = 0.302
> From the above results, it is evident that the two variables data is normally distributed.
> cor(Sales,YoE,method='pearson')
[1] 0.6242311
> # There is a positive moderate correlation between Sales and YoE.
> # As the above result is not tested its significances statistically,
> # let us perform the statistical significant test
> # NH: There is no correlation between Sales and YoE
> # AH: There is a correlation between Sales and YoE
> cor.test(Sales,YoE,method='pearson')
Summary
The present session helps in understanding the usage of ordinal and scale tests along with
their respective post-hoc tests based on normality. When the data is not normal or ordinal,
ordinal tests are used & When the data is normal, scale tests are applied. As a part of
ordinal tests, wilcoxon signed rank test is used for analyzing one sample group or paired
sample groups. Further, Mann-Whitney U test is for two sample groups, Kruskal-wallis test
for more than 2 sample groups, Friedman test for repeated responses. As a part of scale
tests, one sample t-test for one sample group, two sample t-test for two sample groups, one-
way ANOVA for morethan two sample groups, paired sample t-test for paired responses and
repeated measures ANOVA for repeated responses. Lastly, to find out the direction and
strength of relationship between two variables, karl pearson coefficient of correlation is used
when the variables data is normal and spearman’s rank correlation is used when data is not
normal or ordinal.
Terminal Questions
Activity
Glossary
Ordinal – data expressed in order or ranking
Scale – data expressed as continuous
Normality – express that the data is symmetry on both sides
Homogeneity – expresses about the variances equality
Post-hoc test – helps in knowing the significant differences across the categories
Bibliography
Davis. (2022). Statistical Testing with R. Vor Publications.
Kloke, J., & McKean, J. W. (2015). Non Parametric Statistical methods Using R. Taylor & Francis.
Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.
e-References
https://datatab.net/tutorial/friedman-test
https://datatab.net/tutorial/kendalls-tau
Video links
https://www.youtube.com/watch?v=iF8nHwLzlxg
https://www.youtube.com/watch?v=dwGzs1D4nyk
Image Credits - NA
Keywords
MODULE – IV
Data Visualization in R
Module 4
Module Description
In this module, the learning starts with the understanding of significance of Data visualization
in present world. It has become a very challenging area for career development too. This
data visualization is the process of presenting the data in different visual forms to understand
the data better. Usually, the data visualization is also referred as presentation of data. This
presentation of data is in two ways namely Graphical presentation of data and Diagrammatic
Presentation of Data. In Graphical Presentation of data, data is plotted for continuous
variables and categorical or discrete are plotted in Diagrammatic presentation of data. Under
Graphical Presentation of data, the module covers scatter plots, histogram, frequency
polygon, frequency curve and Ogives. Whereas in Diagrammatic Presentation of data, the
module covers one-dimensional diagrams using barcharts, two-dimensional diagrams using
piecharts, three-dimensional diagrams using cubes and cones, pictograms and cartograms.
The module introduces a new GUI in R i.e., R Commander –Rcmdr for projecting these plots
along with the usage of basic console window.
Aim
To make us understand different ways of presenting the data using R console and Rcmdr.
Instructional Objectives
This module includes:
Data visualization tools
Showcasing data graphically
Showcasing data diagrammatically
Learning Outcomes
Able to understand the significance of different data visualization tools
Able to apply graphical presentation to data
Able to apply diagrammatic presentation to data
Unit 4.1 Introduction to Data Visualization
In this module, the basic understanding of Data visualization and its types are discussed.
Further, it emphasizes on different data visualization tools available in the market. Latter, it
also talks about four types of Graphical presentation of data namely scatter plot, histogram,
line graphs and boxplots. Further, the module also covers barcharts and pie charts of
Diagrammatic presentation of data.
The basic purpose of doing so is to make every one understand your findings.
Always your boss may not be a tech savy. So, even a non-statistican can understand
this visualization. Second reason to project a large amount of data in a smart way is
the other major purpose of data visualization.
There are two major types :
1. Graphical presentation of data (Continuous data)
2. Diagrammatic presentation of data (Categorical data)
IV.1.2 Data visualization tools
There are several tools available for Data visualization in the market. Some of the
tools are as follows;
1. Excel
2. Tableau
3. Power BI
4. Qlik Sense
5. Google Data Studio
6. Grafana(Open-Source)
7. Python –matplotlib,seaborn (Open-Source)
8. R –graphics,ggplot2,lattice,plotly,Rcmdr (Open Source)
IV.1.3 Introduction to R Commander
R commander (Hutcheson, 2019) is a Graphical User Interface(GUI) in R created by John
Fox, a statistics professor, that helps in performing data analysis as well as visualization too.
It can be loaded using a package name ‘Rcmdr’.
3. Install Rcmdr package using install.packages(“Rcmdr”,dependencies=T)
4. load the package using command library(Rcmdr)
After successful loading, the outcome looks as follows;
In th above Screenshot 1, you will see a screen with two windows. The upper window helps
in showcasing R Script or R Markdown and the lower one is the output window.
The Rcmdr helps in performing all the statistical analysis using parametric and non-
parametric tests under Statistics tab and all graphs under Graphs tab in the menu bar
options. The following screenshot 2 provides an idea of parametric tests and Screenshot 3
provides an idea of non-parametric tests as shown below.
Screenshot 2: Path to parametric tests
In this type of presentation, the continuous data is presented in several forms. They are,
i. Scatter plot - Graphical presentation of correlation
ii. Histogram – A Continuous Barchart.
iii. Frequency Polygon – Connecting the middle points of histogram with straight
lines
iv. Frequency Curve – Smoothening of the Frequency polygon.
v. Ogives (Cumulative Frequency Curve)
Others like line graphs and box plots are also included here. Currently, Scatter plot,
histogram, line graphs and box plots are discussed here.
IV.2.1 Scatter plot
This plot is to showcase the relationship between two continuous variables In
general, it is also referred as a graphical plot of correlation.
In the present context w.r.t our dataset named ‘data127’, 2 varaibles are available
namely ‘Marks’ and ‘Attendance’. Being continuous in nature, correlation is possible.
To view this correlation, let us plot the scatter plot using Rcmdr as follows;
Path: Graphs Scatter Plot
Screenshot 11: Scatter plot
Now, after selecting ‘Scatter plot’, choose ‘Attendance’ in x-variable and ‘Marks’
under y-variable as shown in the Screenshot 11. After selecting the variables, click
on ‘Apply’ to obtain the plot and click ‘OK’ to close the window to see the plot. The
outcome is shown in Screenshot 12.
In this plot, the reader can understand the distribution of data and most importantly
outliers in the data variable are identified. Usually, it is referred as Box and Whisker
plot. Mainly, it provided five observations for the identified range in the data namely
Min.value, Quartile 1(Q1), Quartile 2 (Q2 or median) , Quartile 3(Q3) and Max.value.
These box plots can be drawn either by using Rcmdr or R console too.
Let us draw a box plot for Attendance using Rcmdr, as shown below
Path: Graphs Boxplot…
Fig: 2 Box plot of Attendance Fig: 3 Boxplot of Marks
From the above two box plots placed under Fig:2 and Fig:3, their distribution seems to be
positively skewed for Attendance and Slightly negatively skewed for Marks. Further, it can be
concluded that there are no outliers as you could not see any circles placing outside the
whiskers.
Instead of Rcmdr, if you perform the same after importing the same dataset to R console,
you will get more outcomes too.
In R console perform the following to get Box plot as shown in Screenshot 22.
Screenshot 24: mtcars wt variable with 2 outliers (circles outside upper whisker)
It is clearly evident that there are 2 outliers and they are 3.65 and 5.25 values in the variable.
Once outliers are identified, either you need to remove them or replace them with mean
score, a kind of imputation.
In basic R console, use boxplot(dataframe) syntax to plot multiple boxplots in one go. Do
remember that they should be continuous in nature.
4.3.1 Barcharts
The Barcharts are drawn for categorical or discrete data. In the sense, categorical means
number of male and female in a class or number of students studying in the class come from
different regions. In case of discrete, number of cars passing a toll gate, number of patients
visiting a clinic, number of tourists visiting a place etc.,
These barcharts are of three types. They are as follows;
1. Simple Barchart
2. Sub-divided Barchart (Or Stacked Barchart)
3. Multiple Barchart (Or Grouped Or Side-by-Side Or Parallel)
Let us have them from Rcmdr for the same dataset used for scatter plot.
1. Simple Barchart
Let us have a simple Barchart for Gender and Sections as follows in Screnshot 25.
Path: Graphs Bar graph…
After clicking on ‘Bar garph’, you will get a window with two tabs namely ‘Data tab’ and
‘Options tab’. select ‘Gender’ in Data tab and select
After selecting variable in Data tab and adding x and y labels in Options tab, click on ‘Apply’
to get the result and click ‘OK’ to close the Bar graph window. The result is as follows;
The above two possibilities are plotted below in Screenshot 28 and Screenshot 29.
In this chart, that data project in bars is grouped together based on categorical variables.
Here also, there are two ways of projecting the data in bars. They are,
Before proceeding for Multiple Bar graphs to obtain, need to change the default setting from
divisible( or stacked ) to Side-by-Side option as shown below.
Screenshot 32: Changing stacked to side-by-side option in Options tab
i. Section by Gender
The present chart named pie-cjhart is to project the proportion or percentage distribution.
For Example, Plotting the market share of Top 5 ERP vendors in 2023.
> # A two dimensional diagram
> # pie-chart 2D
> # Using Market share data of 2023,according to software connect.
> ms<-c(24.6,21,15.1,9.4,5.3,24.6)
>lbs<-c("Microsoft - 24.6%","SAP-AG - 21%","Oracle -15.1%","Sage - 9.4%","Infor -
5.3%","Others-24.6%")
> pie(ms,labels=lbs,clockwise=T,col=rainbow(6))
> legend('center',legend=lbs,fill=rainbow(6,main=”Market share of Top 5 ERP vendors”)
Screenshot 37: Pie-chart - 2D of ERP vendors with market share
In order to plot a pie-chart in 3D, plotrix package is used. After installation of package,
pie3D() is used. To get a better outcome , explode argument is used.
> ms<-c(24.6,21,15.1,9.4,5.3,24.6)
> lbs<-c("Microsoft - 24.6%","SAP-AG - 21%","Oracle -15.1%","Sage - 9.4%","Infor -
5.3%","Others-24.6%")
> # In 3D plot of piechart, no argument of clockwise.
> pie3D(ms,labels=lbs,col=rainbow(6),main="Market Share of Top 5 ERP vendors")
The outcome is as follows;
Fig: 4 Market Share of Top 5 ERP vendors
Fig: 5 Market share of Top 5 ERP vendors with Explode Argument
Summary
The present module talks about data visualization, as presenting data in a visual form as an
important area not only for statistician but also for a non-statistician for better understanding
the current situation of a firm or a business. As a part of data visualization, two types of
presentation of data namely graphical presentation of data and diagrammatic presentation of
data. Under graphical presentation of data, scatter plot, histogram , line grahs and box plots
are discussed and under diagrammatic presentation of data, barcharts and its types along
with piechart in 2D and 3D are discussed using R Commander(Rcmdr) in R, a GUI. Overall,
the other typs of graphics namely grid graphics, lattice graphics and others alike can be used
based on the requirement.
Terminal Questions
Answer Keys
1 2 3 4 5 6 7 8 9 10
B D B B C C D A D A
Activity
Create a sample data set to apply graphical presentation of data
Create a sample dataset to apply diagrammatic presentation of data
Glossary
Bibliography
Hutcheson, G. (2019). Data Analyis using R CommanderAn Introduction to R Commander. Sage
Publications.
Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.
e-References
https://www.edureka.co/blog/tutorial-on-importing-data-in-r-commander/
https://www.javatpoint.com/r-data-visualization
https://ladal.edu.au/dviz.html
Video links
https://www.youtube.com/watch?v=MAWY51fI01o
https://www.youtube.com/watch?v=C9_zac1LQ9o
https://www.youtube.com/watch?v=DLbu8HdywAk
Keywords
Visualization scatter plot histogram
Barchart piechart line graph
Boxplot