0% found this document useful (0 votes)
14 views113 pages

BADM Material

The 'Introduction to Business Analytics' course covers essential concepts and tools in business analytics, divided into four modules: Introduction to Business Analytics, Descriptive and Diagnostic Analytics, Ordinal and Scale Tests, and Data Visualization. It emphasizes the importance of analytics in various business domains and introduces statistical programming using R and R Studio. The course aims to equip students with the skills to analyze data and derive insights using various analytical methods and visualization techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views113 pages

BADM Material

The 'Introduction to Business Analytics' course covers essential concepts and tools in business analytics, divided into four modules: Introduction to Business Analytics, Descriptive and Diagnostic Analytics, Ordinal and Scale Tests, and Data Visualization. It emphasizes the importance of analytics in various business domains and introduces statistical programming using R and R Studio. The course aims to equip students with the skills to analyze data and derive insights using various analytical methods and visualization techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 113

Introduction to Business Analytics

Course Description

The present course is the need of the hour in present world. Business Analytics has become
a buzzword. According to India Industry report 2021, the business analytics market is
expected to grow by 21.2% by 2026 (Xpert, 2023).

The Introduction to Business Analytics course is divided into four modules.

Module 1

Introduction to Business Analytics


Introduction, Types of Business Analytics – Descriptive, Diagnostic, Predictive, Prescriptive
and Cognitive Analytics, Overview of R and R Studio – Data Structures, Functions,
Statements and Looping in R, Choose your Test for Data Analysis.

Module 2

Descriptive and Diagnostic Analytics


Descriptive Analytics: Introduction, Measures of Central Tendency, Measures of Dispersion,
Measures of Skewness and Measures of Kurtosis
Diagnostic Analytics: Parametric Vs Non-Parametric Tests, Nominal Tests – Binomial Test,
Mc Nemar’s Test, Cochran’s Q test-post-hoc test, Chi-square test, Phi-Coefficient of
Correlation

Module 3

Ordinal and Scale Tests


Ordinal Tests – Wilcoxon Signed Rank Test, Mann-Whitney U Test, Kruskal-Wallis Test ,
Friedman Tests and related Post-hoc Tests, Spearman Rank Correlation
Scale Tests – T tests – one Sample, Two Sample, Paired Sample, ANOVA – One way and
Two Way with Post-hoc tests, Repeated Measures ANOVA, Karl Pearson’s Coefficient of
Correlation.

Module 4

Data Visualization
Data Visualization: Types of Presentation of Data – Graphical Presentation – Scatter plot,
Histogram; Diagrammatic Presentation – One Dimensional – Bar Charts – Simple, Sub-
divided and Multiple , Two Dimensional – Pie charts 2D and 3D, Other Charts – Box plots,
Line plots Using R Graphics and R Commander/R Deducer.
Table of Contents
Module 1

Introduction to Business Analytics


Unit 1.1 Overview of Business Analytics
Unit 1.2 Applications of Business Analytics
Unit 1.3 Basics of R and Rstudio
Unit 1.4 Choose Your Test

Module 2

Descriptive and Diagnostic Analytics


Unit 2.1 Descriptive Analytics
Unit 2.2 Diagnostic Analytics
Unit 2.3 Nominal Tests

Module 3

Ordinal and Scale Tests


Unit 3.1 Ordinal Tests
Unit 3.2 Scale Tests

Module 4

Data Visualization in R
Unit 4.1 Introduction to Data Visualization
Unit 4.2 Graphical Presentation of Data
Unit 4.3 Diagrammatic Presentation of Data
Introduction to Business Analytics

MODULE – I

Introduction to Business Analytics


Module 1
Module Description
The present module is to provide the basic understanding of Business
Analytics and its types namely Descriptive Analytics, Diagnostic Analytics,
Predictive Analytics, Prescriptive and Cognitive Analytics. In addition to
this content, R and its IDE-R Studio is introduced, being one of the fast
growing statistical programming language for data analysts and data
scientists.
Aim
To provide the basic understanding on Business Analytics and its applications using R.
Instructional Objectives
This unit indents to provide:
 Better understanding of Business Analytics
 Better understanding on R
 Better understanding on R Studio
 Identify the test for data
Learning Outcomes
Upon completion, pupils will be able to answer:
 Define Business Analytics
 Discuss the types of Business Analytics
 Explain the applications of Business Analytics
 Express the features of R
 Distinguish between R and R Studio.
 Able to determine the test for any data

Unit 1.1 Overview of Business Analytics


Business Analytics being a buzz word of 21 st century, the entire industry is looking in to this
area of study. The following are the questions to be addressed to understand this subject
better.

Why?
What is Business? A profitable Activity
What is Analytics? Set of Statistical tools.
What is Business Analytics? Analyzing Business issues using different
statistical tools.

Types of business Analytics::


-------------------------------------
There are five types::
1.Descriptive Analytics -- What happend?
2.Diagnostic Analytics --Why it happend?
3.Predicitve Analytics -- What will happen?
4.Prescriptive Analytics -- What to do?(Solution based)
5.Cognitive Analtyics -- AI solutions -IBM Watson

1. Descriptive Analytics:: Set of Analytical Tools used for describing


the nature of data.
Statistical Concepts/Tools::
-Descriptive Statistics
- Measures of Central Tendency- location of value in the Mean,
Median, Mode
In R-DescTools package - Mean(),Median() and Mode()
- Measures of Dispersion
Absolute Measures Relative Measures
-------------------------------------------------------
1. Range Coefficient of Range
2. Mean Deviation(MD) Coefficient of MD
3. Quartile Deviation(QD) Coefficient of QD
4. Standard Deviation(SD) Coefficient of SD (Not Discussed
as Concept)
5. Variance Coefficient of Variance

In R, range(),max(),min(),sd(), var()

- Measures of Skewness (Kink in the curve)


-karl Pearson's Coefficeitn of Skewness(MOCT)
- Bowley's COSk (Quartiles)
- Kelly's COSk (Deciles and Percentiles)

In R, moments package - skewness()


- D'agostino test-agostino.test()
- Measures of Kurtosis(Change in the peakedness of curve)
In R, moments package - kurtosis()
- Anscombe-Glynn test -anscombe.test()
-Data Visualization (Presentation of Data)
-RCommander package ggplot2 package

- Two types- 1.Graphical Presentation of Data


2.Diagrammatic presentation of Data
-Correlation Analysis
-psych package - pairs.panels()
-Principal component Analysis(PCA)
- Dimension Reduction Technique
- Default package stats & base - princomp()
Other name in Industry-EDA - Exploratory Data Analysis
software tools::
-------------------
1. Excel - XLSTAT
2. SYSTAT
3. SAS - JMP
4. Minitab
5. R >19,000 packages -- stats, base along with graphics
6. python --- Numpy,Pandas,scipy
7. tableau,PowerBI, Spotfire, QlikView
8. SPSS (PSPP)

2. Diagnostic Analytics -- Why it happened?


Majorly, it deals with Inferential Staistics that analyses the sample
and draws inferences to the population.
-Take the help of Hypothesis(Singular)/Hypotheses (Plural)
What/ How do we do Hypothesis testing?
- It is a process of testing a sample to draw inferences to the
population.
Types of Hypothesis:: There are three types. They are,
1. Research Hypothesis
2. Statistical Hypothesis
3. Substantive Hypothesis
1.Research Hypothesis:: Hypothesis to be tested. In general, it is
Alternate Hypothesis.
2.Statistical Hypothesis:: Statistical way of projecting the
hypothesis.
i. Null Hypothesis (H0 or Hnull)
ii.Alternative Hypothesis (H1 or ha or Halternative)
NH:Marks=15
AH Marks# 15( Both > and < in consideration)

NH: Marks <=15 ( Students Didn't perform Well)


AH: Marks > 15 ( Students performed well)

NH: Marks >=15 ( Students peformed well)


AH: Marks < 15 (Students did not perform well)
3. Substantive Hypothesis:: To understand whether the Hypothesis
i.e. to be tested is having a practical relevance or not, means does that
hypothesis is practically implementable in reality or not.
Once hypothesis is formed, it is to be tested using a procedure called
Hypothesis testing procedure. The procedure is as follows.
Hypothesis Testing Procedure::
==============================
We use this procedure for analyzing a sample to draw inferences to
the population. The steps are,
1. State the Hypothesis
NH & AH
NH: Students didnot perform well (marks <=15)
AH: Students performed well (marks > 15)
2. Determine the level of significance(Type I Error)- 5%
3. Choose the test
Tests of Hypothesis:: 1. Parametric tests &
2. Non-Parametric tests
1. Parametric tests
Sample size < 30 - Small Sample - T-test
Sample size >=30 - Large Sample - Z test
Data is Normal
2. Non-Parametric tests
Data need not be normal - distribution free tests

Note: When the data is not normal, apply Data Transformation


4. Compute the test (Using Formula - test statistic - t-stat)
5. Determine the tabular value by degrees of freedom( t-tab)
6. Interpret the result.
If t-stat > t-tab - Reject NH
If t-stat < t-tab - Accept NH or
If p-value < 0.05 - Reject NH
If p-value > 0.05 - Accept NH
--------------------------------------------------------------------------------------------------
----------------------
Example: To test the normality of data - (2,4,10,15,1,150)
Step1:State Hypothesis
NH: Data is normal
AH: Data is not normal
Step 2: Level of significance (Alha) -0.05 (5%)
Step 3: Choose the test
- Use Shapiro wilks test being the sample size within 3to5000
Step 4: Compute the test
- ontained p-value is -.0002395
Step 5: Optional
Step 6: Interpret the result
As p-value is 0.0002395 < 0.05, reject NH, means Data is not
normal

As the data is not normal, apply data transformation(dt)

After dt, the p-value is 0.7455 > 0.05, accept NH -Data is normal.
In the entire process of Hypothesis testing, there is also an attempt to
reduce errors in hypothesis. Conventionally, there are two types
namely Type I Error and Type II Error. Later, conceptually extended to
Type III and IV Errors too. They are as follows;
Types of Errors in the Hypothesis
1. Type I Error –Alpha -–Level of Significance – Producer’s Surplus in
Economics
2. Type II Error –Beta (1-Beta=> Power of the Test)
3. Type III Error – Wrongly projecting Hypothesis
4. Type IV Error – Wrongly Interpreting the results
Ho is True | Ho is False
-------------------------------------------------------------
Accept H0: Correct Decision | Type II Error (Beta)
Reject H0: Type I Error(Alpha) | Correct Decision(1-Beta)
- Power of the test
--------------------------------------------------------------
Correct Hypothesis:: NH: Data is not Normal
AH: Data is Normal
dt<-c(10,11,12,15,14,16,14,17)

If p-value < 0.05,Reject NH


If p-value > 0.05,Accept NH
As the p-alue is o.8156 > 0.05

All the tests of Hypothesis( pT as well as NPT) are calssified based on


Levels of Measurement.
What are the levels of measurement?
- Four levels - Nominal, ordinal, interval and ratio
1. Nominal data - Data meant for identification/calssification
Examples- Gender(male/Female)
- Marital Status (Single/Married)
- Colours
- Names of persons, items,countries...
-Categories under nominal data or variable are incompariable.
Examples of Nominal
2.Ordinal - Categories in variables are comparible
Examples: Performance level - Low, Moderate, High
Education Qual: 10th, Inter, UG, PG, Ph.D, PDF
Likert Item/scale: Agree-DisAgree scale
Important-Unimportant scale
Favourable-UnFavourable scale
1 2 3 4 5
Example: He is Hardworking SDA DA N A SA
He is a smart Worker SDA DA N A SA
He is Intelligent SDA DA N A SA
He is a good Team member SDA DA N A SA
4 8 12 16 20
3. Interval - Distance between the categories is same.
- Example - Likert scale
temp
- No Absolute Zero
4. Ratio - With Absolute Zero means Zero means Zero only.
Example: Height, Marks, Sales, Weight...
Summary::
LOM\Features: Ident./Classif. Order/Rank EqualDist AbsZero
-------------------------------------------------------------------------------------
1. nominal yes no no no
2. ordinal yes yes no no
3. interval yes yes yes no
4. ratio yes yes yes yes
-------------------------------------------------------------------------------------
Nominal - Categorical variable
Ordinal - Ordered Categorical variable
Interval & Ratio - scale data-Continuous data
2. Diagnostic Analytics:: This analytics helps in diagnosing the data to
explore the reasons.
1. Descriptive Statistics
2. Inferential Statistics(A part)
3. Correlation & Regression Analysis
4. Time-Series Analysis
5. Cluster Analysis
6. Decision Tree modelling
7. Root-Cause Analysis
Tools:: SPSS, PSPP, R, Python, Minitab, Tableau, PowerBI, SYSTAT, SAS
3. Predictive Analytics:: Tools used to predict or forecast the future.
* Three purposes: Estimation, Prediction and Forecasting
1. Estimation: point estimate & interval Estimate
2. Prediction: linear Models(lm)
general linear models(glm)
generalized linear models(glm)
generalized linear mixed models(glmm)
3.Forecasting: Moving Averages,
Exponential Smoothing
ARIMA - AutoRegressive Integrated Moving Averages
VAR - Vector AutoRegressive Models
ARCH -AutoRegressive Conditional HeteroScedasticity
GARCH- Generalized ARCH
* to forecast the future at present point in time
* For Estimating the sales--> profits
* Spread of diseases
* Weather forecasting
* predicting share price.
* Developing predicting models of defaulters for banks.

Statistical Methods:: Regresion Analysis, Discriminant model,


Logistic Regression,
Decision Tree Analysis
Random Forest.....

Statistical Softwares:: Excel, R, Python, minitab, systat, SAS

4. Prescriptive Analytics:: Provides Optimum Solutions for the


problems.
- Covers Operations Research
- Linear programming Problems
- Integer programming
- Goal Programming
- mixed integer programming -Manufacturing
- Decision tree analysis
- Game theory ------------ Finance
- Sequencing Problems ---Manufacturing
- Transportation problems
- Assigment problems
- Network Analysis- CPM/PERT/GERT
- Queuing Theory- waiting time & service time
- Simulation-Electrical and Electronics
- Quality Control Charts

Tools: Excel --- Excel Solver


LINDO Systems/ LINGO
AIMMS
IBM ILOC CMPLX
R - lpSolve,lpSolveAPI, RSymphony,Rglpk,quadProg,Gurobi
Python-MIP,scipy Optimizer,scikit optimizer,Gurobi,PulP

5. Cognitive Analysis::Provides optimum solutions by using AI


- IBM Watson -13 -14 year old boy wrote a code
- Amazon' Alexa
- Google's Assistant
- Apple's Siri
- Alpha Go/Alpha Zero--stockfish8

Unit 1.2 Applications of Business Analytics

This section deals with the applications of Business Analytics in various domains namely
Marketing, Finance, Human Resources,Operations, They are as follows;
1. Marketing (CCDV):- Describing Demographics
Customer lifetime value(CLTV)
Cluster Analysis - Market Segmentation
Conjoint Analysis - New Product Development
(Control(2),Weight(3),Price(3), Color(4),PC(2))
Market basket Analysis -Sales,Cross-Selling
Demand / Sales Forecasting - Trend Analysis
2.Finance : - share price movements- ARIMA/ARCH/GARCH
-ARCH - AutoRegressive Conditional Heteroscedasticity
- Volatility - Engle (A noble
- GARCH - Generalized ARCH models- Bollerslev
-
SGARCH,TGARCH(Threshold),MGARCH,APGARCH(Asymptotic Power)
GJR GARCH,FGARCH...
- Prediction and for Assessment
- Standard Deviation for choosing the best project-A&B
3. Human Resources:- Employee Attrition ; Employee performance;
Employee Empowerment; Emotional
Intelligence;
Performance Appraisal;
4. Operations:- Raw material Procurement
Logistics- Transportation Cost
Linear Programing Problem(LPP)- For Max profits of
Mugs(x1) & Jugs (X2)
Max. Z=
3x1+4X2 - objective Function
Sub to Contraints:
2X1+5X2<=40 -- Raw material Contraint
3X1+3X2<=50 -- Labor hour Contraint
Where x1,x2>=0
Quality Control Charts - To measure the product Q.
- QCC for variables
QCC for attributes
- qcc package in R
Job Sequencing Problems
Assignment problems
BA is playing a vital roe in all the domains as well as in all industries
like Healthcare, FMCG, Agriculture, IT, Logistics,textile
Retail as naming a few.

Unit 1.3 Basics of R and Rstudio

This unit talks about the basics of R and R studio in detail.


Fig:1 R and R Studio
1.3.1 Basic Understanding of R
What is R?
- It is a statistical programming language provided by Ross Ihaka and Robert
Gentleman in 1995 at University of Auckland, New zealand. Originally, the R
language is a precursor of S language developed by John Chambers and his
members at Bell Laboratories in 1976.
- It is an OpenSource –Source code is freely available to modify and serve to
others.
- It is freely downloadable and useful.
Why do we need R?
- To handle Big-Data (Volume, Variety, Velocity, Veracity)
- To handle wide-range of Applications in almost all industries.
What is there in R?
- very rich in packages ~>19,000 (as on July 2, 2023, there are 19,789 packages)
- Very rich in graphics -graphics, ggpot2,deducer,GrapheR
- Can be used for Machine learning too.
What can we do with R?
- All types of Analytics
1.3.2 Data types in R
There are six major data types in R. They are as follows;
- numeric, integer, complex, logical, raw, character
Exercises::
> #DATA TYPES in R
> # numeric
> a=34 # Scalar – A Single element vector
>a
[1] 34
> a<-c(34,44,54) – Vector – Set of elements with numbers – Numeric vector
>a
[1] 34 44 54
> b<-c(‘A’,’B’,’C’) –Vector with Alphabets or characters or Strings – Character
vector
[1] “A” “B” “C”
> # typeof(),mode() and class() functions for identifying datatype, storage mode
and the class to which that particular object belongs to respectively.
> typeof(a)
[1] "double"
> mode(a)
[1] "numeric"
> class(a)
[1] "numeric"
> # Integer
> # represnts integer with L
> i<-34L
> typeof(i)
[1] "integer"
> mode(i)
[1] "numeric"
> class(i)
[1] "integer"
> # complex
> cx<-2+4i
> typeof(cx)
[1] "complex"
> mode(cx)
[1] "complex"
> class(cx)
[1] "complex"
> # logical
> a=TRUE
> typeof(a)
[1] "logical"
> mode(a)
[1] "logical"
> class(a)
[1] "logical"
> # Data types in R – Few more exercises
> # numeric
> a=54
>a
[1] 54
> typeof(a)
[1] "double"
> mode(a)
[1] "numeric"
> class(a)
[1] "numeric"
> a<-c(54,34,44)
> typeof(a)
[1] "double"
> mode(a)
[1] "numeric"
> class(a)
[1] "numeric"
> # integer
> a<-23L
> typeof(a)
[1] "integer"
> mode(a)
[1] "numeric"
> class(a)
[1] "integer"
> #comples
> #complex
> a<-12+4i
>a
[1] 12+4i
> typeof(a)
[1] "complex"
> mode(a)
[1] "complex"
> class(a)
[1] "complex"
> # logical
> a<-TRUE
> typeof(a)
[1] "logical"
> mode(a)
[1] "logical"
> class(a)
[1] "logical"
> # raw
> # charater
> a<-"Srikanth is a Good boy"
>a
[1] "Srikanth is a Good boy"
> typeof(a)
[1] "character"
> mode(a)
[1] "character"
> class(a)
[1] "character"
> # raw – It represents sequence of bytes in hexadecimal code.
> # charToRaw()
> a<-'Srikanth'
>a
[1] "Srikanth"
> charToRaw(a)
[1] 53 72 69 6b 61 6e 74 68
> # Order of precedence – This is to know which data type is of higher order.
> a<-c(10,13,12L)
>a
[1] 10 13 12
> typeof(a)
[1] "double"
> mode(a)
[1] "numeric"
> class(a)
[1] "numeric"
> # numeric>integer
> a<-c(10,13,12L,"pooja")
> typeof(a)
[1] "character"
>a
[1] "10""13""12""pooja"
> mode(a)
[1] "character"
> class(a)
[1] "character"
> # Conclusion
> # character > numeric > integer – from the above exercise.
1.3.3 Data operators in R
There are four major data operators in R. They are as follows;
1. Assignment Operators: <-,<<-,=,->,->>
2. Arithmetic Operators: +,-,*,/,%%,^,**
3. Relational or Comparative Operators: ==,!=,>=,<=,>,<
4. Logical Operators: &&,& --- AND,||,| --OR
Others: combining operator(c) & Colon Operator(:)
> # Assignment Operators : <-,<<-,=,->,->>
> a<-12
> b<<-22
> d=32
> 42->e
> 52->>f
> print(paste("The value of a for operator <- is",a))
[1] "The value of a for operator <- is 12"
> print(paste("The value of b for operator <<- is",b))
[1] "The value of b for operator <<- is 22"
> print(paste("The value of d for operator = is",d))
[1] "The value of d for operator = is 32"
> print(paste("The value of e for operator -> is",e))
[1] "The value of e for operator -> is 42"
Note: By simply typing the letters also, you can get the outcomes.
>2. Arithmetic Operators
> a=12
> b=4
> print(paste("The addition of a and b using + operator is",a+b))
[1] "The addition of a and b using + operator is 16"
> print(paste("The subtraction of a and b using - operator is",a-b))
[1] "The subtraction of a and b using - operator is 8"
> print(paste("The division of a and b using / operator is",a/b))
[1] "The division of a and b using / operator is 3"
> print(paste("The mod of a and b using %% operator is",a%%b))
[1] "The mod of a and b using %% operator is 0"
> print(paste("The product of a and b using * operator is",a*b))
[1] "The product of a and b using * operator is 48"
> print(paste("The power of a to b using ** operator is",a**b))
[1] "The power of a to b using ** operator is 20736"
> # logical Operators
> # &&, ||, &, |
> p=12
> q=21
> p && q # use for scalar than for vector
[1] TRUE
>d
[1] 12 10 21
>f
[1] 21 11 31
> d & f # use for scalar and vector
[1] TRUE TRUE TRUE
> p || q
[1] TRUE
> d |f
[1] TRUE TRUE TRUE
> # 3. Relational Operators
> a=12
> b=4
> print(paste("The outcome of a and b using == operator is",a==b))
[1] "The outcome of a and b using == operator is FALSE"
> print(paste("The outcome of a and b using != operator is",a!=b))
[1] "The outcome of a and b using != operator is TRUE"
> print(paste("The outcome of a and b using >= operator is",a>=b))
[1] "The outcome of a and b using >= operator is TRUE"
> print(paste("The outcome of a and b using <= operator is",a<=b))
[1] "The outcome of a and b using <= operator is FALSE"
> print(paste("The outcome of a and b using > operator is",a>b))
[1] "The outcome of a and b using > operator is TRUE"
> print(paste("The outcome of a and b using < operator is",a<b))
[1] "The outcome of a and b using < operator is FALSE"
> # Combining operator
> # c – combines elements as a vector.
> a=23
>a
[1] 23
> a<-c(23,43,45)
>a
[1] 23 43 45
> # colon operator (:) – produces sequence of elements
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> a<-21:212
>a
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[18] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[35] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
[52] 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
[69] 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
[86] 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
121 122
[103] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139
[120] 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
155 156
[137] 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
172 173
[154] 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
189 190
[171] 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
206 207
[188] 208 209 210 211 212
> a[18]
[1] 38
> a[105]
[1] 125
> a[192]
[1] 212
1.3.4 Data Structures in R
These are the basic structures in which the data is arranged for further
analysis. There are six major data structures in R. They are as follows;
1. vector & scalar
# vector  One dimensional data structure with morethan one
element.
# Scalar  A single element vector
Examples::
> a<-21 # scalar
>a
[1] 21
> b<-c(10,12,14,17,21) # vector
>b
[1] 10 12 14 17 21
Applying arithmetic operations on vector(s) is called
Vectorization.
Examples::
> b<-c(10,12,15)
> d<-c(2,3,5)
> b+d
[1] 12 15 20
2. matrix – A Two dimensional data structure with same data type at a
time.
> # syntax::
matrix(vector,nrow,ncol,byrow,dimnames=list(rownmes,colnames))
Example:
> # let us create a matrix for 3 students with variables-
rollno,marks,age
> data<-c(101,102,103,15,17,19,21,20,22)
> m<-
matrix(data,nrow=3,ncol=3,byrow=F,dimnames=list(c('r1','r2','r3'),c('ro
llno','marks','age')))
>m
rollno marks age
r1 101 15 21
r2 102 17 20
r3 103 19 22
3. array – A multidimensional data structure with same data type.
> # syntax:
array(vector,dim=c(r,c,d),dimnames=list(rownames,colnames))
Example:
> data<-c(101,102,103,15,17,19,21,20,22)
> ar<-
array(data,dim=c(3,3,1),dimnames=list(c('r1','r2','r3'),c('rollno','marks'
,'age')))
> ar
,,1

rollno marks age


r1 101 15 21
r2 102 17 20
r3 103 19 22
4. list – A one-dimensional data structure that can handle different data
types of different sizes.
Or A one-dimensional data structure that can handle different vectors
of different data types of different sizes.
Example: Create three different vectors with different data types and
sizes too.
> names<-c('Ramu','Raju','Sita')
> marks<-c(12,16)
> age<-c(21,20,22)
> res<-list(names,marks,age)
> res
[[1]]
[1] "Ramu""Raju""Sita"
[[2]]
[1] 12 16
[[3]]
[1] 21 20 22

5. data frame: A two-dimensional data structure with different vectors


of different data types of the same size.
> names<-c('Ramu','Raju','Sita')
> marks<-c(12,16,17)
> age<-c(21,20,22)
> df<-data.frame(names,marks,age)
> df
names marks age
1 Ramu 12 21
2 Raju 16 20
3 Sita 17 22
Operations on data frame
> # Operations on data frame includes-
> # 1. Adding a column
> # 2. Removing a column
> # 3. Adding a row
> # 4. Removing a row
> # let us start
> # 1. Adding a column
> # Create another column and use cbind() to add the created one to the
exisiting dataframe
> # Create location vector
> location<-c('Guntur','Tenali','Tenali')
> df<-cbind(df,location)
> df
names marks age location
1 Ramu 12 21 Guntur
2 Raju 16 20 Tenali
3 Sita 17 22 Tenali
> # 2. Removing a column
> # Remove age column from df
> df<-df[-3]
> df
names marks location
1 Ramu 12 Guntur
2 Raju 16 Tenali
3 Sita 17 Tenali
> # 3. Adding a row
> # In order to add a row, create a new data frame with the same columns.
> df1<-data.frame(names='lokesh',marks=18,location='vijayawada')
> df1
names marks location
1 lokesh 18 vijayawada
> # use rbind() to add created row dataframe to existing dataframe
> df<-rbind(df,df1)
> df
names marks location
1 Ramu 12 Guntur
2 Raju 16 Tenali
3 Sita 17 Tenali
4 lokesh 18 vijayawada
> # 4. Removing a row
> # let us remove 2nd row
> df<-df[-2,]
> df
names marks location
1 Ramu 12 Guntur
3 Sita 17 Tenali
4 lokesh 18 vijayawada
> # Extracting the data using Indexing and slicing using old df
> df
names marks age
1 Ramu 12 21
2 Raju 16 20
3 Sita 17 22
> # Indexing is a process of extracting a particular element from the
data frame
> # In Indexing, the syntax is dataframe[rows,columns]
> # In Slicing, the syntax is dataframe[sr:er,sc:ec]
> # Further in slicing, dataframe[c(random rows),c(random cols)] can
be extracted.
> df[1] # Extracting first column
names
1 Ramu
2 Raju
3 Sita
> df[3] # Extracting third column as a column
age
1 21
2 20
3 22
> df[,3] # Extracting third column as a vector
[1] 21 20 22
> df[1,] # Extracting first row
names marks age
1 Ramu 12 21
> df[3,] # Extracting third row
names marks age
3 Sita 17 22
> # Extracting an element with row and column - 16 marks to be
extracted
> df[2,2]
[1] 16
> # Extracting the name sita
> df[3,1]
[1] "Sita"
> # Slice only marks and age of Sita i.e., 17 20
> df[3,2:3]
marks age
3 17 22

6. factor – The data structure that represents a categorical variable.


> # It has two functions-is.factor() and as.factor()
> # Gender
> gd<-c('m','m','f','f','m','m')
> gd # by data it is categorical but not for r
[1] "m""m""f""f""m""m"
> is.factor(gd) # To check whether gd is a factor or not
[1] FALSE
> as.factor(gd)
[1] m m f f m m
Levels: f m
Even, as.factor() can be applied on any variable, you feel to be
considered as categorical.
1.3.5 Functions in R
Basically, there are two types of functions in R. They are as follows;
1. In-built Functions - Readily available and written by earlier
researchers/statisticans
Ex: mean(), median(), sd(), length()...
2. User-defined Functions: Your own functions
Syntax: function_name<-function(variable(s)){
Statement(s)
}
Ex: cube(x) # To develop this function perform the
following.
cube<-function(x){
return(x^3)
}

1.3.5 Conditional and looping statements in R


1. Conditional Statements: if, if-else, nested if-else and switch
2. Looping statements: for, while, repeat

1. Conditional Statements: if, if-else, nested if-else and switch


These statements help in programming as well as for performing
data analysis too.
i. if statement
ii. if-else statement
iii. nested if-else statement
iv. switch statement

i. if statement – if condition is ok, then statements are


executed.
Syntax: if(condition){
Statement(s)
}
Example::
> marks=65
> if(marks>=60){
+ print("Achieved First Class")
+}
[1] "Achieved First Class"

ii. If-else statement – if condition is satisfied, outcome gets


generated but,
If condition is satisfied, statements
under else gets generated.
Syntax: if(condition){
Statement(s)
} else {
Statement(s)
}
Example::
> marks=59
> if(marks>=60){
+ print("Achieved First Class")
+ } else{
+ print("Sorry")
+}
[1] "Sorry"

iii. nested if-else statement – if multiple if-else has to be used,


use this.
Syntax: if(condition) {
Statement(s)
} else if(condition){
Statement(s)
} else if(condition){
Statement(s)
} else {
Statement(s)
}
Example::
> marks=63
> if(marks>=35 & marks<50){
+ print("Achieved Third Class")
+ } else if(marks>=50 & marks<60){
+ print("Achieved Second Class")
+ } else if(marks>=60 & marks<=100){
+ print("Achieved First Class")
+ } else {
+ print("Sorry, Failed or verify your marks")
+}
[1] "Achieved First Class"
iv. switch statement – to switch to a particular outcome
Syntax: switch(expression,case1,case2,case3…)
Marks<-'P'
switch(Marks, 'P'=print("Passed in the exam"),
'F'=print("Failed in the exam"),
'D'=print("Detained in the
exam"))
2. Looping statements: There are three looping statements.
i. for loop
ii. while loop
iii. repeat

i. for loop
Syntax: for ( var in range()){
Statement(s)
}
Example::
> # print 1 to 5 numbers
> for(i in 1:5){
+ print(paste("The number is",i))
+}
[1] "The number is 1"
[1] "The number is 2"
[1] "The number is 3"
[1] "The number is 4"
[1] "The number is 5"

Another Example with user input::


# Sum of n numbers
n=readline("Enter n value:") # Asks user to provide input
temp=0
for(i in 1:n){
temp=temp+i
if(i==n){
print(paste("Sum of n numbers is",temp))
}
}
[1] "Sum of n numbers is 15"
ii. while loop
Syntax: while(condtion){
Statement(s)
}
Example: Print even numbers
i=1
n=as.numeric(readline("Enter n value:"))
Enter n value:5
while(i<=n){
if(i%%2==0){
cat(i,"is an even number\n")
}
i=i+1
}
2 is an even number
4 is an even number
iii. repeat
Syntax: repeat{
Statement(s)
If(condition) break
}
Example::
# Repeat the name 5 times
i<-1
repeat{
print("R World")
if(i>=5)break
i=i+1
}
[1] "R World"
[1] "R World"
[1] "R World"
[1] "R World"
[1] "R World"

Note: R Studio is the Integrated Development Environment (IDE) of R.


Differences between R and R Studio are as follows;

R language R Studio

Initially with one window, With four windows


extended to one more.
1. R Script
1. R Console 2. R Console
2. R Scipt 3. R Environment
4. R Files & Folders

Manually, you need to enter the Automatically, syntaxes will obtain


syntaxes

A bit typical to download packages Easy to download packages

Not suitable to saving formatted Easy to save formatted workouts


workouts using R notebook or R markdown

Not Suitable to develop packages Suitable to develop packages

Unit1.4 Choose your Test


In this unit, the challenge is to identify a test for a particular variable or a group
of data based on the normality check.
Inferential Statistics
Mo
re
tha
n2
Testi Sa
Levels Type ng Descrip two mpl Paire Repe
of of Nor Test of tive One Sampl e d ated R/t
Measu variabl malit Hypoth Statisti Sample e gro Resp Samp b/t 2
rement e/Data y esis cs group groups ups onse le Var
Binomi
al test/ Phi
Chisqua Chi- Coeffi
re square Chi- Cochr cient
Not Non- test(Go test squ McNe an's of
Nomin Catego Nor Parame odness- (Associ are mar's Q Correl
al rical mal tric Mode of-fit) ation ) test test test ation
Not Spear
Catego Nor man's
Ordinal rical mal Kru Wilco Rank
Non-
Wilcoxo skal xon Correl
Parame
n Mann- - Signe ation/
tric
Not All 4 Signed Whitn Wal d Fried Kenda
Contin Nor Measur Rank ey U lis Rank man' ll's
Scale uous mal es Test test test test s test Tau
Karl
On Pears
e Repe on's
Wa Paire ated Coeffi
Two y d Meas cient
All 4 One Sampl AN Samp ures of
Contin Nor Parame Measur Sample e T- OV le t- ANO Correl
Scale uous mal tric es t-test test A test VA ation

Summary

 Analyzing the data with set of statistical tools is referred as Business Analytics.
 There are five types of analytics namely Descriptive Analytics, Diagnostic Analytics,
Predictive Analytics, Prescriptive Analytics and Cognitive Analytics.
 Basic Analytics includes the first two types and the rest of them comes under
Advanced Business Analytics.
 Descriptive Analytics helps in describing the behavior of data and Diagnostic
Analytics helps in knowing the reasons for a situation.
 Explains the data types (like numeric, logic, character, complex, integer and raw),
data operators (Assignment, Arithmetic, Relational, logical), data structures(scalar,
vector, matrix, array, list, dataframe, factor) functions, conditional(if, if-else, nested if-
else, switch) and looping statements(for, while and repeat) in R.
 Finally, a table has been provided to understand the way to choose a test based on
type of variable and normality.

Self Assessment Questions

1. Which of the following is not a type of Business Analytics?


a. People Analytics b. Diagnostic Analytics
c. Predictive Analytics d. Prescriptive Analytics
2. Which of the following is not a type of Advanced Business Analytics?
a. Descriptive Analytics b. Prescriptive Analytics
c. Cognitive Analytics d. Predictive Analytics
3. Which of the following is not a data type in R?
a. Float b. Integer c. Numeric d.Logical
4. Which of the following is not a data structure in R?
a. Vector b. Scalar c. list d.Complex
5. Which of the following is not a looping statement in R?
a. next b. for c. repeat d. while
6. Which of the following is not a conditional statement in R?
a. break b. if c. if-else d. nested if-else
7. Which of the following is not an operator in R?
a. Assignment Operators
b. Arithmetic Operators
c. Algebraic Operators
d. Relational Operators
8. Which of the following are all keywords in R?
a. next b. break c. a &b d. none
9. Which of the following is not a function in R?
a. for() b. while() c.cube() d. matrix()
10. Which of the following is a test for measuring the relationship between 2 continuous
variables when the data is not normal?
a. Pearson correlation b. Spearman’s Rank Correlation
c. a & b d. none

Terminal Questions
1. Explain the significance of Business Analytics.
2. Elaborate the types of Business Analytics.
3. Illustrate examples for data types in R.
4. Illustrate examples for data operators in R.
5. Illustrate examples for data structures in R.
6. Distinguish user-defined functions from in-built functions in R.
7. Provide a chart for identifying the test based on variables and normality.

Answer Keys (for Self-Assessment Questions)


1. a 6. a
2. a 7. c
3. a 8. c
4. d 9. c
5. a 10. b

Activity

Activity type: Practice in System Duration: 60 min

1. Explore the real-time applications of Business Analytics.


2. Create a data set and apply all the related functionalities on it.

Glossary
Analytics: Analyzing data using a set of statistical tools
Business Analytics: Analysing business data using set of statistical tools.
Descriptive Analytics: Analytics that helps in describing the behavior of data
Diagnostic Analytics: Analytics that help in exploring the reasons behind the behavior of data
Data type :Expresses the nature of an R Object.
Data Structure: Describes the organization of data in an R object
Function: Set of Statements

Bibliography
Xpert, A. (2023, March 10). Admissions Xpert. Retrieved March 10, 2023, from
https://admissionxpert.in/careers-in-business-analytics/

e-References
https://www.simplilearn.com/types-of-business-analytics-tools-examples-jobs-article
https://www.statmethods.net/r-tutorial/index.html
https://www.datacamp.com/tutorial/data-types-in-r
https://www.w3schools.com/r/default.asp

Video links

Topic Link
Business Analytics https://www.youtube.com/watch?
&R v=eDrhZb2onWY&list=PL9ooVrP1hQOEIUTpxRf4infBJnquwaTME
Basics of R https://www.youtube.com/watch?v=5hqm1ItZjyo

Image Credits Fig1: Google


Keywords
 Analysis * Diagnostic Analytics * Data types
 Analytics * Predictive Analytics * Data structures
 Business Analytics * Prescriptive Analytics * next
 Descriptive Analytics * Cognitive Analytics * break
Descriptive and Diagnostic Analytics

MODULE – II

Descriptive and Diagnostic Analytics


Module II
Module Description
In this module, the learning starts with the understanding of the application of descriptive
analytics for a sample data. Further, the study proceeds with the application of inferential
statistics as a part of Diagnostic Analytics by applying Hypothesis testing. This module
initiates by testing nominal data using respective tests namely binomial test for one sample
group, chi-square test for two sample groups and more than two sample groups, Further, the
McNemar’s test is used for paired response and Cochran’s Q is used for Repeated
response. Finally, Phi Coefficient test is used for testing the relationship between two
nominal variables.
Aim
To understand the application of descriptive analytics and a part of diagnostic analytics on
sample data.
Instructional Objectives
This module includes:
 Application of Descriptive Analytics
 Application of Nominal tests of Diagnostic Analytics
Learning Outcomes
 Able to apply descriptive Analytics for a sample data
 Able to apply binomial test for one sample groups
 Able to apply chi-square for two and more than two sample groups
 Able to apply mcnemar’s to paired sample response
 Able to apply cochran’s Q test to repeated response
Unit 2.1 Descriptive Analytics

In descriptive analytics, there are four measures (Cooksey, 2020) that help in analyzing the
behavior of data. They are,
i. Measures of Central Tendency (MOCT)
ii. Measures of Dispersion (MOD)
iii. Measures of Skewness (MOsk.)
iv. Measures of Kurtosis (MOku)

2.1.1 Measures of Central Tendency


These measures help in understanding the distribution of data with the help of
three measures. They are mean, median and mode. These are also called as
positional averages. These are also called a Measures of location too
(Mehemetoglu & Mittner, 2021).
Example:
Use mtcars dataset and find out the MOCT.
To analyse the the data, use mpg variable from mtcars dataset.

Example:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> # Extract mpg from mtcars dataset
> # Install DescTools package to compute Mean(), Median() and Mode()
> attach(mtcars) # attach variables of mtcars dataframe to the R Environment
> library(DescTools)
Warning message:
package ‘DescTools’ was built under R version 4.2.2
> Mean(wt
[1] 3.21725 ) # Average weight of 32 cars is 3.21725(1000 lbs; 1lbs=0.45kgs)
> Median(wt)
[1] 3.325
# 50% of the cars are having an average of 3.25 (1000 lbs=0.45kgs) and below.
> Mode(wt)
[1] 3.44 # Most of the cars having an average wt of 3.44 (1000 lbs=0.45) i.e., 3 cars.
attr(,"freq")
[1] 3
From the above mean, median and mode results, the distribution seems to be negatively
skewed. In this case, it may not be sure as the mean<median, not close to each other to
assure negative skewed.

2.1.2 Measures of Dispersion


In order to understand the dispersion of data or spread or variability of data,
these measures are helpful. These are of two types: Absolute and Relative.
Absolute Measures - Relative Measures
--------------------------------------------------------
Range – Coefficient of Range
Mean Deviation – Coefficient of Mean Deviation
Quartile Deviation – Coefficient of Quartile Deviation
Standard Deviation – Coefficient of Standard Deviation
Variance – Coefficient of Variance

For example, relative measures are used to compare height of children measured
in cms and feet in a better way than absolute measures.

In this session, few measures like Range, Quartiles, Standard Deviation and
variance are used in R.

Example:
> # MOD
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> range(mpg)
[1] 10.4 33.9
> rg<-max(mpg)-min(mpg)
> rg
[1] 23.5
> quantile(mpg,0.20)
20%
15.2
> quantile(mpg,0.50)
50%
19.2
> quantile(mpg,0.80)
80%
24.08
> # Standard Deviation of Project A and B
> # Let us assume xyz company has to take up any one project among A and B
with 4 year returns
> pa<-c(10,5,35,10)
> sum(pa)
[1] 60
> pb<-c(15,20,10,15)
> # Intially to know which project to accept, mean of two projects is computed as
follows;
> mean(pa)
[1] 15
> mean(pb)
[1] 15
> # To better identify which project to select use standard deviation(sd)
> sd(pa)
[1] 13.54006
> sd(pb)
[1] 4.082483
> # From the above two project pa and pb, pb is chosen being more consistent in
its return.

2.1.3 Measures of Skewness

- This measure helps in understanding the distribution of data whether the data is
positively skewed or negative skewed one. Theoretically, if skewness is zero, the
data is expressed as normally distributed. In general, skewness may not be zero
exactly, so as per the statisticians, if the skewness values are in between -1 to +1
or -0.5 to +0.5, the data is considered to be normal.

Example in R
> # Skewness
> library(moments)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> # check the skewness of mpg
> # Attach the variables of mtcars to the environment using attach()
> attach(mtcars)
> skewness(mpg)
[1] 0.6404399
> # Above result is positive, means positively skewed
> # To test its statistical significance, use D'Agostino test
> # NH: Data have no Skewness
> # AH: Data have a Skewness
> agostino.test(mpg)

D'Agostino skewness test

data: mpg
skew = 0.64044, z = 1.63510, p-value = 0.102
alternative hypothesis: data have a skewness

> # from the above result it is evident that the p-value is 0.102> 0.05,faile to reject
NH.
> # It is concluded that the data have no skewness means the variable mpg data
is normal.

2.1.5 Measures of Kurtosis

This is the last measure that helps in understanding the distribution of data .
Here, kurtosis talks about the change in the pickedness of the curve. There are
three types of kurtosis. They are,
1. Platykurtic – where k < 3 (Kurtosis) or < 0 (Excess Kurtosis)
2. Mesokurtic – where k=3 (Kurtosis) or =0 (Excess Kurtosis)
3. Leptokurtic – where k > 3(Kurtosis) or >0 (Excess Kurtosis)
Usually, the requirement is that the data should follow mesokurtic to be a normal
distribution. In practical, the value of kurtosis can never be 3, so statisticians gave
a relaxation that if the value of k is around 2, then that data can be considered to
be normal.

Example:
> # Kurtosis
> library(moments)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> attach(mtcars)
> # let us compute kurtosis for mpg
> kurtosis(mpg)
[1] 2.799467
> # From the above value, it is evidient that the value is 2.79,almost close to 3
> # This confirms that the data is mesokurtic, means the data seems to be
normal.
> # To confirm the result is it statistically significant or not, compute Anscombe
test.
> anscombe.test(mpg)

Anscombe-Glynn kurtosis test

data: mpg
kurt = 2.79947, z = 0.20148, p-value = 0.8403
alternative hypothesis: kurtosis is not equal to 3

> # NH: Kurtosis is equal to 3


> # AH: Kurtosis is not equal to 3
> # From the above result it is evident that p-value is 0.8403 >0.05, fails to reject
NH.
> # This means that the Kurtosis is equal to 3, means data is normal.

Unit 2.2 Diagnostic Analytics

In this unit, the concepts cover Hypothesis testing as the diagnosis is happening for the
population to draw certain conclusions based on the sample. As a part of Diagnostic
Analytics, it covers inferential statistics along with the descriptive analytics too. As a part of
Hypothesis testing, different data comes with different levels of measurement namely
nominal, ordinal and scale (Interval or ratio). Based on the level of measurement, the
statistical tests are developed to handle them. As discussed earlier, there are four levels of
measurement (Morris & Sheedy, 2022). They are as follows;
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Usually, the nominal data is used for identification or categorization or for classification.
Some examples are Gender, Marital Status, Colors, Names of places etc., In this data the
categories are not comparable. The moment the categories become comparable, the data
becomes ordinal data like in Educational Qualification where PG students are more qualified
than those in UG. Therefore, ordinal is also referred as ordered categorical or ranked data.
Further, if the distance between these categories is same, the data is termed as Interval but
without absolute Zero like in temperature except Kelvin. Finally, the data with absolute zero
comes under Ratio data like income, sales, profits, marks etc.,.

At present, the focus is on nominal data whose tests are well taken in the following unit.

Unit 2.3 Nominal Tests

The moment nominal tests comes in to the mind, the data is categorical. Analyzing
categorical data is the basic purpose of nominal tests. These nominal tests are classified as
follows;
1. Nominal test for one sample group
2. Nominal test for two sample groups
3. Nominal test for morethan two sample groups
4. Nominal test for paired response
5. Nominal test for repeated response
6. Nominal test for measuring the relationship between two variables

2.3.1 Nominal test for one sample group


In this context, a categorical variable can be analyzed under two scenarios.
Case1: When the categorical variable with two categories only
Case2: When the categorical variable with morethan two categories.

Case1: When the categorical variable with two categories only

In this case, binomial test is used to analyze the proportion distribution of a


categorical variable with two categories. Let us workout on vs variable from
mtcars dataset. Here, vs variable indicates Engine with two types namely v-
shaped engine(0) and straight engine (1). In this case, if the analyst wish to test
his hypothesis that v-shaped engine are preferred more by the car models in 23
cars of mtcars dataset, then binomial test is suitable being vs a categorical
variable. Let us perform the exercise.

Examples 1:
> # Binomial Test - Used to measure the significant distribution of proportions of a
variable with 2 categories.
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> dt<-table(mtcars$vs) # Extracting vs from mtcars dataset using $ symbol
> dt

0 1
18 14
> # NH: 50% of the cars having V-shaped Engine.
> # AH: 50% of the cars are not having V-shaped Engine.

> binom.test(dt)

Exact binomial test

data: dt
number of successes = 18, number of trials = 32, p-value = 0.5966
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3766257 0.7363619
sample estimates:
probability of success
0.5625

> # As the p-value is 0.5966 > 0.05, concludes that the result could not reject NH,
means there are 50% of cars in v-shaped Engine.
> # mtcars datasets are having v-shaped engine. In the result, 56.25 % of the
cars are v-shaped only i.e., nearer to 50%.

# Example 2: To see the pass percentage of 15 MBA students in an interview.


The data is as follows;
> # promotion_status - ps
> ps<-c('pass','pass','fail','fail','pass','pass','pass', 'fail','pass','pass','pass','pass',
'pass','pass','pass')
> length(ps)
[1] 15
> dt<-table(ps)
> dt
ps
fail pass
3 12
> # As the study is looking forward for checking the pass percentage in an
interview, the ps frequency table has to
> # reversed using rev()
> dtf<-rev(dt)
> dtf
ps
pass fail
12 3
> # Now, the data in dtf is used for binomial test
> # NH: 50% of tthe students have passed the Interview (p=0.5)
> # AH: 50% of tthe students have not passed the Interview (p!=0.5)
> binom.test(dtf)

Exact binomial test

data: dtf
number of successes = 12, number of trials = 15, p-value = 0.03516
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5191089 0.9566880
sample estimates:
probability of success
0.8

> # As the p-value is 0.03516 < 0.05, rejects NH and accepts the AH.\
> # This indicates that proportion of pass is not equal to 50%. Now, the prob. of
success(pass%)is 80% i.e., not equal to 50%.

For suppose, extending the same example by adding detained students too to
the list of 15, making it 20 with 3 categories namely ‘pass’,’fail’, and ‘detained’.

The suitable test for this purpose it multinomial test with multinom_test() fron
rstatix packages.

Case2: When the categorical variable with morethan two categories.

> # Example
> ps<-c('pass','pass','fail','fail','pass','pass','pass','fail','pass','pass','pass','pass',
'pass', 'pass','pass','detained','detained','pass','pass','pass')
> length(ps)
[1] 20
> dt<-table(ps)
> dt
ps
detained fail pass
2 3 15
> dtv<-c(2,3,15)
> # let us apply multiom_test() after loading rstatix package
> library(rstatix)
> multinom_test(dtv)
# A tibble: 1 x 2
p p.signif
* <dbl><chr>
1 0.000919 ***
> # NH: Prportion of pass,fail and detained is equal.
> # AH: Prportion of pass,fail and detained is not equal.
> # From the above result, it is evident that the study could reject NH.
> # This means that the propotion of students passed, failed and detained is not
equal.
> # To know which category is having major proportion, proceed for post-hoc test
i.e. pairwise_binom_test()
> # use same package rstatix
> pairwise_binom_test(dtv)
# A tibble: 3 x 9
group1 group2 n estimate conf.low conf.high p p.adj p.adj.signif
* <chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><chr>
1 grp1 grp2 5 0.4 0.0527 0.853 1 1 ns
2 grp1 grp3 17 0.118 0.0146 0.364 0.00235 0.00705 **
3 grp2 grp3 18 0.167 0.0358 0.414 0.00754 0.0151 *
> # From the above result, it is clear that the proportion of pass and fail are not
significantly different.
> # Second, the proportion of group1 and group 3 & group 2 and group 3 are
significantly different.
> # Finally, from the frequency table it is evident that pass proportion is
significantly high from other groups.

2.3.2 Nominal test for two sample groups

In this context, we will be having two variables with internal categories of two
each like gender(2 categories- male and female) and sections(secA and secB).

Usually, to address this context, chi-square test is used. It has four major applications.
They are as follows;
I. Can be used for Goodness-of-fit
II. Can be used for Testing the Independence or Association between two categorical
variables (suitable for 2.3.3 context).
III. Can be used for Testing the Homogeneity
IV. Assessing the population variance from sample variance(Beyond the scope).

Let us have simple examples for the above applications of chi-square test.

I. Goodness-of-fit
This concept states that whether the collected data is having a particular
distribution or not. Here, the distribution of students across the three sections is
tested assuming a uniform distribution of 180 students. The obtained data is
70,30,80. Testing whether this data follows a uniform distribution or
not(i.e.,60,60,60).

> # chisquare test


> strength<-c(70,30,80)
> chisq.test(strength,p=c(1/3,1/3,1/3))

Chi-squared test for given probabilities

data: strength
X-squared = 23.333, df = 2, p-value = 8.575e-06

> # NH: Data follows uniform distribution


> # AH: Data does not follow uniform distribution
> # As p-value is 0.00000857 < 0.05, result rejects the NH, means the data does not follow
Uniform distribution.
>
>
> strength<-c(60,50,70)
> chisq.test(strength,p=c(1/3,1/3,1/3))

Chi-squared test for given probabilities

data: strength
X-squared = 3.3333, df = 2, p-value = 0.1889

> # As p-value is 0.1889>0.05, result fails to reject NH, means the data follows a Uniform
distribution.

II. Testing of Independence or Association (With two categories in each variable)


>m
secA secB
male 50 45
female 20 35
> # NH: No Association between Gender and section
> # AH: An Association between Gender and section
> chisq.test(m)

Pearson's Chi-squared test with Yates' continuity correction

data: m
X-squared = 3.0791, df = 1, p-value = 0.07931

> # As the p-value is 0.07931 > 0.05, fails to reject NH, means No Association between
Gender and Sections.
> info<-c(50,20,20,60)
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB')))
>m
secA secB
male 50 20
female 20 60
> # NH: No Association between Gender and section
> # AH: An Association between Gender and section
> chisq.test(m)

Pearson's Chi-squared test with Yates' continuity correction

data: m
X-squared = 30.496, df = 1, p-value = 3.346e-08

> # As the p-value is 0.00000003346 < 0.05, rejects NH, means there is an Association
between Gender and Sections.
> # Here, as the cell frequencies are more than 5, no need of continuity correction.
> # you can apply this by applying correct = F
> chisq.test(m,correct=F)
Pearson's Chi-squared test

data: m
X-squared = 32.334, df = 1, p-value = 1.298e-08

> # You can observe the p-value increased to 0.00000001298 from 0.00000003346, which
inturn reduces power of the test.
> # In this case, the value of the result has changed but still holds the same outcome that
there is an association between gender and sections.

III. Testing the Homogeniety (With two categories in each variable)


> # In this case, the value of the result has changed but still holds the same outcome that
there is an association between gender and sections.
> # Testing Homogeneity
>m
secA secB
male 50 20
female 20 60
> # NH: The distribution of male and female between the secA and sec B is same.
> # AH: The distribution of male and female between the secA and sec B is not same.
> chisq.test(m)

Pearson's Chi-squared test with Yates' continuity correction

data: m
X-squared = 30.496, df = 1, p-value = 3.346e-08

> # As the p-value is significant, this indicates that distribution of male and female between
secA and secB is not same.
> # Even, more male are in secA and female are in secB.

IV. Assessing population variance from sample variance (Optional)

2.3.3 Nominal test for morethan two sample groups

This scenario moves with two applications namely Testing of Independence or Association
and Testing of Homogeneity. Here, one of the two variables have intenal categories more
than 2.
I. Testing of Independence or Association

In extension to the previous example, let us have their divison gender-wise across the 3
sections. Here, the objective is to verify the association between gender and section.
> info<-c(40,30,20,10,40,40)
> # make a 2*3 matrix with info
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB','secC')))
>m
secA secB secC
male 40 20 40
female 30 10 40
> # Input for chi-square test
> # NH: There is no association between Gender and Sections of Students.
> # AH: There is an association between Gender and Sections of Students.
> chisq.test(m)

Pearson's Chi-squared test

data: m
X-squared = 2.5714, df = 2, p-value = 0.2765

> # As the p-value is 0.2765 > 0.05, it fails to reject the NH.
> # This means that there is no assocaition between Gender and Sections of students.
>
> # Let us change the info to c(50,20,15,15,20,60)
> info<-c(50,20,15,15,20,60)
> m<-matrix(info,2,dimnames=list(c('male','female'),c('secA','secB','secC')))
>m
secA secB secC
male 50 15 20
female 20 15 60
> # Let us apply chi-square test
> chisq.test(m)

Pearson's Chi-squared test

data: m
X-squared = 32.402, df = 2, p-value = 9.206e-08

> # As the above p-value is 0.00000009206<0.05,rejects NH.


> # Accept AH. This means that there is an Association between Gender and Sections.
> # By looking at the table it is evident that more male are in Sec A and more female are in
Sec c.

Note: There is a case where Chi-square is not applicable. If 20% of the cells are having a
cell frequency of less than 5, still you can apply chi-square test but if in a scenario where
more than 20% of the cells are having a cell frequency less than 5, then proceed for
Fisher’s Exact test using fisher.test() in R.

II. Testing Homogeneity


This indicates that the distribution of data across the categories or groups is same. In order
to better understand this concept, let us have an example;
> m # matrix with gender and section data.
secA secB secC
male 50 15 20
female 20 15 60
> # NH: The distribution of male and female across the three sections is same.
> # AH: The distribution of male and female across the three sections is not same.
> chisq.test(m)

Pearson's Chi-squared test

data: m
X-squared = 32.402, df = 2, p-value = 9.206e-08

> # As the p-value is 0.00000009206 < 0.05, reject NH, means Accepting AH.
> # This indicates that the distribution of male and female is not same, means no
Homogeneity.

2.3.4 Nominal test for paired response


In order to analyze the two nominal responses of the same respondent, McNemar’s
test is used.

Example: Pick up a scenario where 15 students got an opportunity to attend 2 rounds


of Interview, assuming none of them are elimination rounds. The data and its analysis is as
follows;

> R1
[1] "pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass"
"pass""pass""pass"
> R2
[1] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail"
"fail"
> dt<-table(R1,R2)
> dt
R2
R1 fail pass
fail 1 2
pass 11 1
> mcnemar.test(dt)

McNemar's Chi-squared test with continuity correction

data: dt
McNemar's chi-squared = 4.9231, df = 1, p-value = 0.0265

> # As the p-value is 0.0265 <0.05, it is evident that the result rejects the null
hypothesis.
> # There is an association between R1 and R2. In other words, the R1 and R2 are
not independent.
> # Association in the sense that where the students performed well in R1, they could
not perform well in the R2.
> # Exploring this reason will be helpful for student betterment.

2.3.5 Nominal test for repeated response


In the earlier scenario, the study is on paired response. Here, the study is on
repeated response where the respondent has more than 2 responses, use Cochran’s Q
test.

Example: Continuing the earlier example, let us add one more Round R3. No, the analysis
demands the researcher to analyse the effect of type of Round on the students
performance. In other words, the analysis is check whether the performance of students in
all the 3 rounds is same or different.

> R1
[1]
"pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""pass"
> R2
[1] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail""fail"
> R3
[1]
"pass""pass""pass""pass""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""fail"
> Outcome<-c(R1,R2,R3)
> Outcome
[1]
"pass""pass""fail""fail""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""pass"
[16] "fail""fail""fail""pass""fail""fail""fail""pass""fail""fail""fail""fail""pass""fail""fail"
[31]
"pass""pass""pass""pass""pass""pass""pass""fail""pass""pass""pass""pass""pass""pass""fail"
> treatment<-c(rep(1,15),rep(2,15),rep(3,15))
> treatment
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
> participant<-c(1:15,1:15,1:15)
> participant
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2
3 4 5 6
[37] 7 8 9 10 11 12 13 14 15
> outcome<-Outcome
> dt<-data.frame(outcome,treatment,participant)
> str(dt)
'data.frame': 45 obs. of 3 variables:
$ outcome : chr "pass""pass""fail""fail" ...
$ treatment : num 1 1 1 1 1 1 1 1 1 1 ...
$ participant: int 1 2 3 4 5 6 7 8 9 10 …
> # NH: Performance of Students is same in all the 3 rounds of Interview.
> # AH: Performance of Students is not same in all the 3 rounds of Interview.
>library(rstatix)
> cochran_qtest(dt,outcome~treatment|participant)
# A tibble: 1 x 6
.y. n statistic df p method
* <chr><int><dbl><dbl><dbl><chr>
1 outcome 15 13 2 0.00150 Cochran's Q test

> # As per the p-value, it is evident that it rejects the NH, means the performance of students
differ in 3 rounds. To know where the difference is, proceed for post-hoc test i.e.,
pairwise_mcnemar_test() in rstatix package.

> pairwise_mcnemar_test(dt,outcome~treatment|participant)
# A tibble: 3 x 6
group1 group2 p p.adj p.adj.signif method
* <chr><chr><dbl><dbl><chr><chr>
11 2 0.0265 0.0795 ns McNemar test
21 3 1 1 ns McNemar test
32 3 0.00937 0.0281 * McNemar test

2.3.6 Nominal test for measuring the relationship between two variables

In this aspect, there comes a scenario where the reseacher would like to know the
relationship between two nominal variables, then proceed for phi coefficient of correlation
(varies from 0 to 1).

Example:

> # Watched a Movie as'wam'


> # Recommended a Movie as 'ram'
> wam<-c('yes','yes','no','yes','no','yes','no','yes','yes','yes','no','yes')
> length(wam)
[1] 12
> ram<-c('yes','no','no','yes','yes','yes','no','no','yes','yes','no','no')
> dt<-table(wam,ram)
> dt
ram
wam no yes
no 3 1
yes 3 5
> phi(dt)
[1] 0.35
> Phi(dt)
[1] 0.3535534
From the above result, it is evident that there is a moderation association between the two
variables.
Summary

In this module, the learner learns about four measures of descriptive statistics namely
measures of central tendency(or location), measures of dispersion(spread or variability),
measures of skewness and kurtosis(shape).

Discussed the usage of Nominal tests on one categorical variable with two categories I.e.,
binomial test, one categorical variable with morethan 2 categories I.e., multinomial test.

Discussed the usage of chi-square and its applications as Goodness-of-fit, Testing


Association or Independence, Testing Homogeneity, and assessing the population variance
from sample variance.

Disucssed the application of mcnemar’s for paired response,cochran’s Q for repeated


response.

Lastly, phi coefficient of correlation is used to measure the relationship between two
variables.

Self Assessment Questions:

1. What is the other name of Measures of central tendency?


a. Measures of Spread
b. Measures of Shape
c. Measures of Location
d. None.
2. What is the other name of Measures of Dispersion?
a.Measures of Spread
b.Measures of Shape
c.Measures of Location
d.None.
3. What is the other name of Measures of Skewness and Kurtosis?
a.Measures of Spread
b.Measures of Shape
c.Measures of Location
d.None.
4. Which test is recommended for multiple responses of a single nominal variable?
a. McNemar’s test
b. Multionomial test
c. Chi-square test
d. None
5. which test is used for paired response of nominal data?
a. Chi-square test
b. McNemar’s test
C. Cochran’s Q test
d.None.
6. which test is used for repeated response of nominal data?
a. Binomial test
b. Chi-Square test
c. Fisher’s test
d. Cochran’s Q test

7. which test is used for association between two nominal variables?


a. Chi-square test
b. McNemar’s test
c. Phi Coefficient of correlation.
d. a & c

8. which is the post-hoc test of multinomial test?


a. Binomial test
b. Chi-square test
c. Pairwise_binom_test
d. Tukey test

9. Which package is used by you for skewness?


a. psych b. moments c.stats d.None

10. Which is the statistical significant test for kurtosis?


a. Agostino d. Anscombe c. moments d. None

Terminal Questions

1. Explain the measures of descriptive statistics with an example.


2. Elaborate the levels of measurement with examples.
3. Illustrate examples for Binomial and Multionomial data.
4. Illustrate examples for Chi-square test and its applications.
5. Provide an example to work out the application of mcnemar’s test and cochran’s Q test
too.
6. Illustrate an example for phi coefficient of correlation.

Answer Keys :

1 2 3 4 5 6 7 8 9 10
C A B B B d c c b d

Activity

 Create a data set with three rounds of interview and perform mcnemar’s as well as
cochran’s q test .
 Use mtcars data set to perform the association test of chisquare test.

Glossary

Skewness: Asymmetry in normal distribution towards positive side or negative side.


Kurtosis: Change in the peakedness of the curve.
Binomial test : test for a categorical variable with only two categories
Multinomial test : test for a categorical variable with morethan 2 categories
Post-hoc test: test that helps in identifying where the significant difference among the
groups.
Mcnemar’s test : test that help in assessing the paired response
Cochran’s Q test: test that helps in assessing the repeated response
Phi Coefficient of correlation :test that helps in assessing the relationship between two
variables.

Bibliography

Cooksey, R. W. (2020). Illustrating Statistical Procedures:Finding Meaning in Quantitative Data.


Springer.

Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.

Morris, S., & Sheedy, M. (2022). General Mathematics. John Wiley & Sons.

e-References

https://www.simplilearn.com/types-of-business-analytics-tools-examples-jobs-article
https://www.statmethods.net/r-tutorial/index.html
https://www.datacamp.com/tutorial/data-types-in-r
https://www.w3schools.com/r/default.asp

Video links

https://www.youtube.com/watch?
v=WbKiJe5OkUU&list=PLFW6lRTa1g83jjpIOte7RuEYCwOJa-6Gz
https://www.youtube.com/watch?v=niJvj7116Kk

Image Credits : NA

Keywords
mean, median, mode, standard deviation, variance, binomial, multinomial, chi-square,
mcnemar, cochranQ, phi coefficient of correlation.
Ordinal and Scale Tests

MODULE – II

Ordinal and Scale Tests


Module 3

Module Description
In this module, the learning starts with the understanding of ordinal and scale data followed
by their tests. Initially, the data is tested for normality and based this result, the selection of
of parametric or non-parametric tests is done. Then, the chosen test is used for analyzing
the objective left behind the collection of that particular data.
Aim
To understand the application of ordinal and scale test when the data is normal or not
normal.
Instructional Objectives
This module includes:
 Application of Ordinal tests
 Application of Scale tests
Learning Outcomes
 Able to apply ordinal tests when the data is not normal or ordinal
 Able to apply scale tests when the data is normal
 Able to identify which test to be used to which type of data.
Unit 3.1 Ordinal Tests

Generally, the data collected from the survey or any other source will be either categorical or
continuous in nature. Initially, if the data is ordered categorical, ordinal tests are directly
applied. Whenever the data won’t meet the assumption of normality, these tests are applied
for continuous variables too. So, the ordinal tests are treated as equivalent non-parametric
tests (Kloke & McKean, 2015) for parametric tests when the data is not normal.

Let us proceed for Non-parametric tests.

3.1.1 Wilcoxon Signed Rank test (One Sample Group)


Purpose: To find out whether the average profits are more than 10 or not.
Package: stats
Function: wilcox.test()
Example:
># Initially, the normality test is conducted for Profits variable.
> shapiro.test(Profits)

Shapiro-Wilk normality test

data: Profits
W = 0.91617, p-value = 0.04809

> # As p-value is 0.04809 < 0.05, reject NH that the data is normal.
> # Proceed for non-parametric test
> # Equivalent non-parametric test is Wilcoxon signed rank test
> # syntax: wilcox.test(vector,mu,alternative)
> # NH: Company's profits is not satisfactory (i.e., mu(profits) <=10)
> # AH: Company's profits is satisfactory (i.e., mu(profits) >10)
> wilcox.test(Profits,mu=10,alternative='greater')

Wilcoxon signed rank test with continuity correction

data: Profits
V = 116, p-value = 0.7541
alternative hypothesis: true location is greater than 10

Warning messages:
1: In wilcox.test.default(Profits, mu = 10, alternative = "greater") :
cannot compute exact p-value with ties
2: In wilcox.test.default(Profits, mu = 10, alternative = "greater") :
cannot compute exact p-value with zeroes

Note: Need not worry about warning as of now.


># From the above result, it is evident that the p-value is 0.7541, means fails to reject NH.
># This indicates that the Company’s profits is not satisfactory.

3.1.2 Mann-Whiteny U test (Two Sample Groups)


Purpose: To analyze the impact of one categorical variable on a continuous variable.
In our context, to analyze the impact of Region on profits of the company.
Package: stats
Function: wilcox.test()
Example:
> # Impact of Region on Profits
> # Testing Normality
> library(rstatix)
> dt %>% group_by(Region) %>% shapiro_test(Profits)
# A tibble: 2 × 4
Region variable statistic p
<fct> <chr> <dbl> <dbl>
1 Guntur Profits 0.932 0.405
2 Vijayawada Profits 0.835 0.0244
> # From the above result it is evident that data of Guntur is normal but not for Vijayawada.
> # Proceed for two sample non-parametric test i.e., Mann-Whitney U test or Wilcoxon Rank
Sum Test.
> # NH:No impact of Region on Profits
> # AH:An impact of Region on Profits
> wilcox.test(Profits~Region,data=dt)

Wilcoxon rank sum test with continuity correction

data: Profits by Region


W = 103, p-value = 0.07553
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...) :
cannot compute exact p-value with ties

> # From the above result, p-value is 0.07553 > 0.05, fails to reject NH.
> # This indicates that accepting AH means Region has no affect on Profits.
> # To know Profts Region-wise use aggregate function
> aggregate(Profits,by=list(Region),median)
Group.1 x
1 Guntur 12.0
2 Vijayawada 4.5
> aggregate(Profits,by=list(Region),mean)
Group.1 x
1 Guntur 11.25
2 Vijayawada 7.25
By looking at the above mean and median values, there is a enough difference in their sales
or profits but not statistically significant, so the results are as obtained above.
Note: As per the warning, if there are more ties in the data or ranks, Jonckheere-Terpstra
test is applicable. Available in DescTools, PMCMRplus packages in R, as naming a few.

3.1.3 Kruskal-Wallis test (More than 2 Sample groups)


Purpose: The purpose of this test to analyze the impact of categorical variable on
continuous or ordinal data.
Package: stats
Function: kruskal.test()
Example:
In this context, Table 2 is not suitable as all the groups of Brands w.r.t Sales or Profits
providing normality. So, a small modifications are done by the author to the same dataset to
apply this test. The modified dataset is as follows;
Sales Region Brands Profits YoE
12 Guntur BrandA 8 5
14 Guntur BrandA 10 5
15 Guntur BrandA 11 6
17 Guntur BrandA 12 6
18 Guntur BrandA 12 8
18 Guntur BrandA 12 8
20 Guntur BrandA 13 8
13 Guntur BrandA 9 7
17 Guntur BrandB 9 5
19 Guntur BrandB 15 6
16 Guntur BrandB 12 7
18 Guntur BrandB 12 6
Vijayawad
19 a BrandB 14 6
Vijayawad
21 a BrandB 16 10
Vijayawad
14 a BrandB 12 6
Vijayawad
17 a BrandB 13 6
Vijayawad
8 a BrandC 7 5
Vijayawad
9 a BrandC 5 5
Vijayawad
10 a BrandC 4 6
Vijayawad
7 a BrandC 4 7
Vijayawad
8 a BrandC 4 7
Vijayawad
6 a BrandC 3 4
5 Vijayawad BrandC 4 4
a
Vijayawad
7 a BrandC 4 3
Table1: sample dataset
> dt<-read.csv(file.choose(),strinsgsAsFactors=T)
> str(dt)
'data.frame': 24 obs. of 5 variables:
$ Sales : int 12 14 15 17 18 18 20 13 17 19 ...
$ Region : Factor w/ 2 levels "Guntur","Vijayawada": 1 1 1 1 1 1 1 1 1 1 ...
$ Brands : Factor w/ 3 levels "BrandA","BrandB",..: 1 1 1 1 1 1 1 1 2 2 ...
$ Profits: int 8 10 11 12 12 12 13 9 9 15 ...
$ YoE : int 5 5 6 6 8 8 8 7 5 6 ...

> library(rstatix) # load the library


> dt %>% group_by(Brands) %>% shapiro_test(Profits)
# A tibble: 3 × 4
Brands variable statistic p
<fct> <chr> <dbl> <dbl>
1 BrandA Profits 0.919 0.425
2 BrandB Profits 0.951 0.720
3 BrandC Profits 0.758 0.0101
> # From the above normality testing, for BrandA and BrandB, the data is normal but not for
BrandC.
> # It provides an opportunity to apply kruskal-wallis test, an equivalent non-parametric test
for One-Way ANOVA
> # kruskal.test()
> kruskal.test(Profits~Brands,data=dt)

Kruskal-Wallis rank sum test

data: Profits by Brands


Kruskal-Wallis chi-squared = 17.32, df = 2, p-value = 0.0001733

> # As the p-value is 0.0001733 < 0.05, rejects NH.


> # NH: Average Profits provided by three brands is same.
> # AH: Average Profits provided by three brands is not same.
> # Or
> # NH: No Impact of Brands on Profits
> # AH: An Impact of Brands on Profits
> # Reject NH means accepting the AH, indicates that the Brands are affecting on Profits.
> # To know which brand has contributed more, proceed for post-hoc test named Dunn test.
>library(rstatix)
> dunn_test(dt,Profits~Brands)
# A tibble: 3 × 9
. y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Profits BrandA BrandB 8 8 1.25 0.210 0.210 ns
2 Profits BrandA BrandC 8 8 -2.81 0.00495 0.00989 **
3 Profits BrandB BrandC 8 8 -4.06 0.0000483 0.000145 ***

From the above result, it is evident that there is no significant difference in the mean scores
or median scores of BrandA and BrandB much. But, in the case of BrandA and BrandC,
BrandB and BrandC there is lot of statistically significant difference between them at 5 %
level of significance.
Futher, to know which Brand did well, use aggregate to get the numbers for comparison as
follows;
> aggregate(Profits,by=list(Brands),median)
Group.1 x
1 BrandA 11.5
2 BrandB 12.5
3 BrandC 4.0.
Finally, it is evident that the Brand B is performing better than other two brands namely
BrandA and BrandC.

3.1.4 Wilcoxon Signed Rank test (Paired Sample Group)


Purpose : To analyse the paired response of respondents.
Package: stats
Function: wilcox.test()
Example:
> # Wilcoxon Signed Rank test
> dt<-read.csv(file.choose())
> dt
BeforeT AfterT
1 10 15
2 10 16
3 14 17
4 16 20
5 13 16
6 12 19
7 10 18
8 10 17
9 10 16
10 12 19
> str(dt)
'data.frame': 10 obs. of 2 variables:
$ BeforeT: int 10 10 14 16 13 12 10 10 10 12
$ AfterT : int 15 16 17 20 16 19 18 17 16 19
> # Testing normality
> detach(dt)
> attach(dt)
> BeforeT
[1] 10 10 14 16 13 12 10 10 10 12
> shapiro.test(BeforeT)

Shapiro-Wilk normality test

data: BeforeT
W = 0.81791, p-value = 0.02391

> shapiro.test(AfterT)

Shapiro-Wilk normality test

data: AfterT
W = 0.93388, p-value = 0.4872

> # NH: Data is normal


> # AH: Data is not normal.
> # From the above results, it is evident that BeforeT data is not normal where AfterT is
normal
> # As on the two variables, one is not normal, proceed for equivalent non-parametric test
for paired response.
> # The equivalent test is Wilcoxon signed rank test.
> # use wilcox.test(),already used the same for one sample group too.
> # NH: Training is not effective (Avg_marks_BeforeT >= Avg_marks_AfterT)
> # AH: Training is effective(Avg_marks_BeforeT < Avg_marks_AfterT)
> wilcox.test(BeforeT,AfterT,alternative='less',paired=T)

Wilcoxon signed rank test with continuity correction

data: BeforeT and AfterT


V = 0, p-value = 0.002865
alternative hypothesis: true location shift is less than 0
Warning message:
In wilcox.test.default(BeforeT, AfterT, alternative = "less", paired = T) :
cannot compute exact p-value with ties
> # From the result, it is evident that NH is rejected as p-value is 0.002865 < 0.05.
> # It is concluded that Avg_marks_AfterT is greater than Avg_marks_BeforeT.
> # So, Training is effective.

3.1.5 Friedman Test (Repeated Response)


Purpose: To analyze the repeated responses of respondents. In this context, to analyze the
performance of students in three different tests is the major objective.
To perform this exercise, the following dataset is used.
Outco Treatme Participa
me nt nt
10 1 1
10 1 2
14 1 3
16 1 4
13 1 5
12 1 6
10 1 7
10 1 8
10 1 9
12 1 10
15 2 1
16 2 2
17 2 3
20 2 4
16 2 5
19 2 6
18 2 7
17 2 8
16 2 9
19 2 10
19 3 1
20 3 2
18 3 3
19 3 4
18 3 5
19 3 6
19 3 7
20 3 8
18 3 9
20 3 10
Table 2: Dataset for Friedman Test

Purpose: To analyze repeated responses of the same subject or respondent. In this context,
the study is on analyzing the performance of students in three tests.
Package: ‘stats’ for friedman, ‘PMCMRplus’ for post-hoc tests of friedman.
Function: friedman.test()
Example:
> # Friedman Test
> dt<-read.csv(file.choose()) # Save the dataset in csv and import it.
> str(dt)
'data.frame': 30 obs. of 3 variables:
$ Outcome : int 10 10 14 16 13 12 10 10 10 12 ...
$ Treatment : int 1 1 1 1 1 1 1 1 1 1 ...
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
> # Check whether this data is normal or not.
> library(rstatix)
> dt %>% group_by(Treatment) %>% shapiro_test(Outcome)
# A tibble: 3 × 4
Treatment variable statistic p
<int> <chr> <dbl> <dbl>
1 1 Outcome 0.818 0.0239
2 2 Outcome 0.934 0.487
3 3 Outcome 0.832 0.0352
> # AS the p-values of two Treatments namely 1 and 3 is not normal, proceed for non-
parametric test .
> # The context of repeated responses recommends fiedman test as alternative to Repeated
measures ANOVA.
> # Apply freidman test
> # NH: Performance of students in three tests is same.
># AH: Performance of students in three tests is not same.
> friedman.test(Outcome~Treatment|Participant,data=dt)

Friedman rank sum test

data: Outcome and Treatment and Participant


Friedman chi-squared = 17.897, df = 2, p-value = 0.0001299
># As the p-value is 0.0001299 < 0.05, rejects NH, concludes that there is a significant
difference in their performance across the three tests. To know where the difference is,
proceed for post-hoc tests.
There are so many post-hoc tests namely Nemenyi test, Conover test, Siegel, Exact and
Miller test.
># use PMCMRplus package
> frdAllPairsNemenyiTest(y=Outcome,groups=Treatment,blocks=Participant)

Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way


balanced complete block design

data: y, groups and blocks

1 2
2 0.0273 -
3 0.0001 0.2608

P value adjustment method: single-step


> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.
> frdAllPairsConoverTest(y=Outcome,groups=Treatment,blocks=Participant)

Pairwise comparisons using Conover's all-pairs test for a two-way balanced complete
block design

data: y, groups and blocks


1 2
2 0.025 -
3 8.1e-05 0.251

P value adjustment method: single-step

> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.

> frdAllPairsSiegelTest(y=Outcome,groups=Treatment,blocks=Participant)

Pairwise comparisons using Siegel-Castellan all-pairs test for a two-way balanced


complete block design

data: y, groups and blocks

1 2
2 0.02025 -
3 0.00011 0.11752

P value adjustment method: holm


> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.

> frdAllPairsMillerTest(y=Outcome,groups=Treatment,blocks=Participant)

Pairwise comparisons using Miller, Bortz et al. and Wike all-pairs test for a two-way
balanced complete block design

data: y, groups and blocks

1 2
2 0.03665 -
3 0.00019 0.29376

P value adjustment method: none


> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.

> frdAllPairsExactTest(y=Outcome,groups=Treatment,blocks=Participant)

Pairwise comparisons using Eisinga, Heskes, Pelzer & Te Grotenhuis all-pairs test with
exact p-values for a two-way balanced complete block design
data: y, groups and blocks

1 2
2 0.016 -
3 2.1e-06 0.148

P value adjustment method: holm


Note: Nemenyi, Conover and Miller are non-parametric tests and
> From the above result, it is evident that students performed equivalently in Test 1 and Test
2, Test1 and Test3 & Test 2 and Test 3 as their p-values are significant, significant and
insignificant respectively.

Overall, from all the above five post-hoc tests, the common fact is students’ performed
equivalently in Test2 and Test3, differently in Test1.
To know in which test they performed better, have aggregate scores computed with median,
as follows;
> aggregate(Outcome,by=list(Treatment),median)
Group.1 x
1 1 11
2 2 17
3 3 19
Finally, it is evident that students outperformed in Test 3 and least performed in Test1.

3.1.6 Spearman Rank Correlation (rho)


Purpose: This correlation test is used for ordinal data or when the continuous data is not
normally distributed.
In this context, the analysis is to find whether there is any relationship between Sales and
Profits.
Package: stats
Function: cor(x,y,method=’spearman’),cor.test(x,y,method=’spearman’)
Example: The following data set helps in exploring the variables for analysis.
Sales Profits YoE
12 8 5
14 10 5
15 11 6
17 12 6
18 12 8
18 12 8
20 13 8
13 9 7
17 9 5
19 15 6
16 12 7
18 12 6
19 14 6
21 16 10
14 12 6
17 13 6
8 7 5
9 5 5
10 4 6
7 4 7
8 4 7
6 3 4
5 4 4
7 4 3
Table 3: Data used for Correlation
> shapiro.test(Sales)

Shapiro-Wilk normality test

data: Sales
W = 0.92128, p-value = 0.06236

> shapiro.test(Profits)

Shapiro-Wilk normality test

data: Profits
W = 0.90336, p-value = 0.02538

 From the above results, it is clearly evident that between Sales and Profits, Sales
data is normally distributed, whereas Profits is not. So, proceed for non-parametric
correlation test i.e., Spearman Rank correlation equivalent to Karl Pearson’s
coefficient of correlation.
> # NH: No correlation between Sales and Profits
> # AH: A Correlation between Sales and Profits
> cor(Sales,Profits,method='spearman')
[1] 0.9360426
> cor.test(Sales,Profits,method='spearman')

Spearman's rank correlation rho

data: Sales and Profits


S = 147.1, p-value = 1.874e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9360426

Warning message:
In cor.test.default(Sales, Profits, method = "spearman") :
Cannot compute exact p-value with ties

 The p-value is 0.0000000000184 < 0.05, rejects NH, concludes that there is
statistically significant positive strong correlation based on value (i.e., 0.9360426)
too.
Note: If there are ties, proceed for kendall tau (Sharma, 2018). Here, kendall tau has three
versions namely tau a, tau b and tau c. tau a is for data without ties, tau b is for data with ties
having ordinal and interval data, tau c is for handling ties and missing observations too for
ordinal and interval data too.

Unit 3.2 Scale Tests

These are used for continuous variables in association with categorical variables in
performing particular tests. As these tests use parameters, they consider all the data for
analysis which makes it original and reliable. Even, whenever the data is not normal, the
people try to make the data normal using Data transformation too. Its results are more
reliable. Suppose if the data collected is continuous in nature, it has to meet normality
assumption in order to apply parametric tests (Davis, 2022) else non-parametric tests are
recommended. Most of the times, the attempt of a researcher is to use parametric than non-
parametric wherever applicable because of its nature of including all observations in the
analysis.

Before proceeding further, the data used is as follows;

Sales Region Brands Profits


12 Guntur BrandA 8
14 Guntur BrandA 10
15 Guntur BrandA 11
17 Guntur BrandA 12
18 Guntur BrandA 13
18 Guntur BrandA 12
20 Guntur BrandA 13
13 Guntur BrandA 9
17 Guntur BrandB 9
19 Guntur BrandB 15
16 Guntur BrandB 12
18 Guntur BrandB 13
Vijayawad
19 a BrandB 14
Vijayawad
21 a BrandB 16
Vijayawad
14 a BrandB 11
Vijayawad
17 a BrandB 13
Vijayawad
8 a BrandC 7
Vijayawad
9 a BrandC 5
Vijayawad
10 a BrandC 4
Vijayawad
7 a BrandC 4
Vijayawad
8 a BrandC 3
Vijayawad
6 a BrandC 3
Vijayawad
5 a BrandC 2
Vijayawad
7 a BrandC 4
Table 4: sample dataset with 2 categorical and 2 continuous variables.

> # Import the data


> dt<-read.csv(file.choose(),stringsAsFactors=T)
> str(dt)
'data.frame': 24 obs. of 5 variables:
$ Sales : int 12 14 15 17 18 18 20 13 17 19 ...
$ Region : Factor w/ 2 levels "Guntur","Vijayawada": 1 1 1 1 1 1 1 1 1 1 ...
$ Brands : Factor w/ 3 levels "BrandA","BrandB",..: 1 1 1 1 1 1 1 1 2 2 ...
$ Profits: int 8 10 11 12 12 12 13 9 9 15 ...
$ YoE : int 5 5 6 6 8 8 8 7 5 6 ...

> dt
Sales Region Brands Profits YoE
1 12 Guntur BrandA 8 5
2 14 Guntur BrandA 10 5
3 15 Guntur BrandA 11 6
4 17 Guntur BrandA 12 6
5 18 Guntur BrandA 12 8
6 18 Guntur BrandA 12 8
7 20 Guntur BrandA 13 8
8 13 Guntur BrandA 9 7
9 17 Guntur BrandB 9 5
10 19 Guntur BrandB 15 6
11 16 Guntur BrandB 12 7
12 18 Guntur BrandB 12 6
13 19 Vijayawada BrandB 14 6
14 21 Vijayawada BrandB 16 10
15 14 Vijayawada BrandB 12 6
16 17 Vijayawada BrandB 13 6
17 8 Vijayawada BrandC 7 5
18 9 Vijayawada BrandC 5 5
19 10 Vijayawada BrandC 4 6
20 7 Vijayawada BrandC 4 7
21 8 Vijayawada BrandC 3 7
22 6 Vijayawada BrandC 3 4
23 5 Vijayawada BrandC 2 4
24 7 Vijayawada BrandC 4 3

> attach(dt) # to attach the variable to R Environment

3.2.1 One Sample T-test (One sample Group)


Purpose: To analyze one continuous variable.
Package: stats
Function: t.test()
Example: To analyze the performance (in sales in lakhs) of AKP company
Assumptions:
i. Normality using shapiro.test() from stats package.
> # NH: Data is normal
> # AH: Data is not normal
> shapiro.test(Sales)

Shapiro-Wilk normality test

data: Sales
W = 0.92128, p-value = 0.06236

 From the above result it is evident that the p-value is 0.06236 > 0.05, fails to
reject NH, so the data is normal.

Now, proceed for One Sample T-test in order to meet the purpose of evaluating the
performance of 24 stores of AKP company.

NH: Company is not performing well (mu <=15 lakh in sales)


AH: Company is performing well (mu > 15 lakhs in sales)

The result of one sample t test is as follows;


> # syntax: t.test(vector,mu,alternative)
> t.test(Sales,mu=15,alternative='greater')

One Sample t-test

data: Sales
t = -1.3083, df = 23, p-value = 0.8982
alternative hypothesis: true mean is greater than 15
95 percent confidence interval:
11.91999 Inf
sample estimates:
mean of x
13.66667
> From the above result, it is evident that the p-value is 0.8982 > 0.05, means fails to
reject NH, means the company AKP is not performing well.

Note: mu value is 15 taken hypothetically, means your choice for analysis.

3.2.2 Two sample T-test (Two Sample Groups)


In order to apply this test, assume a scenario of measuring the affect of Regions on Sales.
Anyway, initially it was found that overall sales were not satisfactory. Let us see that will
there be any affect of Region on their sales.

Purpose: To analyze the impact of one categorical variable on the other continuous
variable.
In this context, to analyze the impact of Region (Categorical) on Sales(Continuous) of the
Company.
Package: stats
Function: t.test()
Example: To analyze the performance (in sales in lakhs) of AKP company
Assumptions:
There are two basic Assumptions.
1. Normality - Shapiro Wilks test
2. Homogeneity – Bartlett’s test or Levene’s Test

Let us workout on Assumptions first.


1. Normality
In order to perform category-wise normality, use rstatix package as follows;
> library(rstatix)
> dt %>% group_by(Region) %>% shapiro_test(Sales)
# A tibble: 2 × 4
Region variable statistic p
<fct> <chr> <dbl> <dbl>
1 Guntur Sales 0.950 0.642
2 Vijayawada Sales 0.864 0.0549
> # As we know the hypothesis for normality is as follows;
> # NH: Data is normal
> # AH: Data is not normal
> # As the p-values are 0.642 and 0.0549 are greater than 0.05, accept NH, means normal.
> # Now, proceed for two sample parametric test. Compute Homogeneity test too.

2. Homogeneity- Equal variances

> # Hypothesis for Homogeneity


> # NH: Variances are Equal (Homogeneity Exists)
> # AH: Variances are not Equal (Homogeneity does not Exist)
> bartlett.test(Sales~Region)

Bartlett test of homogeneity of variances

data: Sales by Region


Bartlett's K-squared = 5.9802, df = 1, p-value = 0.01447

> # As p-value is 0.014447, concludes that the study could reject the NH, means variances
are not equal.
> # Choose a two sample parametric test where variances are unequal -Welch Two sample
test (default test in t test in R)
> # Hypothesis for welch two sample test
> # NH: No impact of Region on Sales
> # AH: An impact of Region on Sales
> # Other way of expressing the same hypothesis is as follows
> # NH: Average sales at Guntur is equal to the Average sales at Vijayawada
> # AH: Average sales at Guntur is not equal to the Average sales at Vijayawada
> t.test(Sales~Region,data=dt) # default two sample test is welch two sample test

Welch Two Sample t-test

data: Sales by Region


t = 3.1923, df = 15.347, p-value = 0.005917
alternative hypothesis: true difference in means between group Guntur and group
Vijayawada is not equal to 0
95 percent confidence interval:
1.834904 9.165096
sample estimates:
mean in group Guntur mean in group Vijayawada
16.41667 10.91667

> # As p-value is 0.005917 < 0.05, rejects NH.


> # This indicates that AH is accepted stating that there is an impact of Region on Sales.
> # By observing mean scores, it is evident that Guntur Region is doing well with 16.4 lakh
sales better than Vijayawada with 10.91 lakhs.

Note: If homogeneity exists, include argument as var.equal=T, provides Two Sample T-test
results.
3.2.3 One Way ANOVA (Morethan 2 Sample Groups)
In order to apply this test, it has to meet the following assumptions.
1. Normality by Shapiro-Wilks test
2. Homogeneity by Bartlett’s or Levene’s test
3. Independence of Observations – Assuming the data is collected independently.
Purpose: To analyze the impact of a categorical variable (2 or more categories) on a
continuous variable.
In our context, to analyze the impact of type of Brand(A,B,C) on the Sales of the company.
Package: stats
Function: aov()
Example:
> # ONe-Way ANOVA
> help(aov)
> head(dt)
Sales Region Brands Profits YoE
1 12 Guntur BrandA 8 5
2 14 Guntur BrandA 10 5
3 15 Guntur BrandA 11 6
4 17 Guntur BrandA 12 6
5 18 Guntur BrandA 12 8
6 18 Guntur BrandA 12 8
> # Impact of Brands on Sales of the company.
> # 1. Normality
> library(rstatix)
># NH: data is normal
># AH: data is not normal
> dt %>% group_by(Brands) %>% shapiro_test(Sales)
# A tibble: 3 × 4
Brands variable statistic p
<fct> <chr> <dbl> <dbl>
1 BrandA Sales 0.954 0.751
2 BrandB Sales 0.981 0.966
3 BrandC Sales 0.983 0.975
># In the above result, as all p-values of Brands > 0.05, groups data is normal.
> # Proceed for a parametric test i.e., ANOVA.
> # Being only one Independent variable, it is referred as One-Way ANOVA.
> # let us test Homogeneity too.
> # 2. Homogeneity
> # NH: Variances are equal (i.e., Homogeneity exists)
> # AH: Variances are not equal (i.e., Homogeneity does not exist)
> bartlett.test(Sales~Brands)

Bartlett test of homogeneity of variances

data: Sales by Brands


Bartlett's K-squared = 1.9854, df = 2, p-value = 0.3706
> # The above result concludes that Homogeneity exists as p-value is 0.3706,accepts NH.
> # Now, proceed for One-way ANalysis Of VAriance (ANOVA)
> # use aov() from stats(default package)
> # Hypothesis of one-way ANOVA
> # NH: Average Sales of three Brands is approximately same.
> # AH: Average Sales of three Brands is not same.
> model<-aov(Sales~Brands,data=dt)
> summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Brands 2 468.6 234.29 46.97 1.77e-08 ***
Residuals 21 104.7 4.99
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # The above p-value is 0.0000000177 < 0.05, reject NH
> # The above p-value of ANOVA result indicates that there is a significant impact of
> # Brands on the sales of the company. To know which Brand gave better sales, proceed
> # for post-hoc test. Choose post-hoc test where variances are equal (as proved) i.e.,
> # TukeyHSD test (default in R).
> # Post-hoc test when variances are equal.
> TukeyHSD(model)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = Sales ~ Brands, data = dt)

$Brands
diff lwr upr p adj
BrandB-BrandA 1.750 -1.064726 4.564726 0.2814838
BrandC-BrandA -8.375 -11.189726 -5.560274 0.0000007
BrandC-BrandB -10.125 -12.939726 -7.310274 0.0000000

> # From the above result based on the diff and p adj, we have to conclude which brand is
better.
> # If you observe BrandB-BrandA,their diff value is positive means BrandB's mean >
BrandA's mean
> # If you observe BrandC-BrandA,their diff value is negative means BrandC's mean <
BrandA's mean
> # If you observe BrandC-BrandB,their diff value is negative means BrandC's mean <
BrandB's mean
> # In short, BrandB's mean >BrandA's mean >BrandC's mean; Brand B is better performer.
> # To confirm, compute aggregate
> aggregate(Sales,by=list(Brands),mean)
Group.1 x
1 BrandA 15.875
2 BrandB 17.625
3 BrandC 7.500

Finally, from the above result, it is evident that BrandB is performing better than BrandA and
BrandC respectively.
3.2.4 Paired Sample T-test (Paired Responses)
In order to apply this test, paired response of respondents is needed.
Purpose: To analyze the paired response of respondents i.e., before and after, Test1 and
Test2 marks, as naming a few.
Purpose: To analyze the paired response of respondents.
Package: stats
Function: t.test(…,paired=T….)
Example:
> # Paired sample t-test
> # This test for those who give paired response.
> dt<-read.csv(file.choose())
> dt
BeforeT AfterT
1 10 15
2 12 16
3 14 17
4 16 20
5 13 16
6 12 19
7 11 18
8 10 17
9 10 16
10 12 19
> str(dt)
'data.frame': 10 obs. of 2 variables:
$ BeforeT: int 10 12 14 16 13 12 11 10 10 12
$ AfterT : int 15 16 17 20 16 19 18 17 16 19
> # Assumption
> # 1. Testing Normality
> # if normality exists proceed for Paired sample t-test
> # use t.test() from stats package
> attach(dt) # attaching variables to R Environment
> shapiro.test(BeforeT)

Shapiro-Wilk normality test

data: BeforeT
W = 0.89749, p-value = 0.2056

> shapiro.test(AfterT)

Shapiro-Wilk normality test

data: AfterT
W = 0.93388, p-value = 0.4872

> # NH: Data is Normal


> # AH: Data is not Normal
> # From the above results, it is evident that both variables BeforeT and AfterT are normal.
> # As it is 2 responses of the same 10 respondents, proceed for paired sample t-test.
> # use t.test()
> # NH: Avg(BeforeT) >= Avg(AfterT)
> # AH: Avg(BeforeT) < Avg(AfterT)
> t.test(BeforeT,AfterT,alternative='less',paired=T)

Paired t-test

data: BeforeT and AfterT


t = -9.8419, df = 9, p-value = 2.043e-06
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf -4.312838
sample estimates:
mean difference
-5.3

> # From the above result, it is evident as the p-value is 0.000002043 < 0.05,rejects NH.
> # Therefore, it is observed that the Avg marks of students before training is lessthan
> the Avg marks of students after training.It can be concluded that the Training is
effective.

3.2.5 Repeated Measures ANOVA (Repeated responses)


Purpose: To analyse the repeated responses of the same respondent. In this context, the
emphasis is on the impact of type of test on the performance of students(i.e., marks).
Package: rstatix
Function: anova_test()
Example: Initially, the data is arranged or obtained in the following format. Here, the marks
of students in three tests are taken. Now, the challenge is to statistically analyse in which
test the students have performed well using Repeated Measures ANOVA.
Test1 Test2 Test3
10 15 19
10 16 20
14 17 18
16 20 19
13 16 18
12 19 19
10 18 19
10 17 20
10 16 18
12 19 20
Table 3: Basic Data of 3 tests

Outcom Treatme Participa


e nt nt
10 1 1
10 1 2
14 1 3
16 1 4
13 1 5
12 1 6
10 1 7
10 1 8
10 1 9
12 1 10
15 2 1
16 2 2
17 2 3
20 2 4
16 2 5
19 2 6
18 2 7
17 2 8
16 2 9
19 2 10
19 3 1
20 3 2
18 3 3
19 3 4
18 3 5
19 3 6
19 3 7
20 3 8
18 3 9
20 3 10
Test 4: Modified format for 3 tests
> # Repeated Measures ANOVA
> dt<-read.csv(file.choose())
> str(dt)
'data.frame': 30 obs. of 3 variables:
$ Outcome : int 10 17 14 16 13 12 15 11 10 12 ...
$ Treatment : int 1 1 1 1 1 1 1 1 1 1 ...
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
> # Repeated Measures ANOVA has two major assumptions.
> # 1. Normality
> # 2. Sphericity
> # - Variances of differences between the groups is same.
># 1. Normality
> # NH: Data is normal
> # AH: Data is not normal
> dt %>% group_by(Treatment) %>% shapiro_test(Outcome)
# A tibble: 3 × 4
Treatment variable statistic p
<int> <chr> <dbl> <dbl>
1 1 Outcome 0.942 0.578
2 2 Outcome 0.934 0.487
3 3 Outcome 0.903 0.238
> # The above result indicates that the data is normal as the treatments p-value > 0.05.
> # As the data is normal proceed for parametric test i.e.,Repeated Measures ANOVA.
> # NH: Performance of Student in 3 Tests is same
># AH: Performance of student in 3 Tests is not same.
> dt %>% anova_test(dv='Outcome',wid='Participant',within='Treatment')
ANOVA Table (type III tests)

$ANOVA
Effect DFn DFd F p p<.05 ges
1 Treatment 2 18 28.547 2.61e-06 * 0.644

$`Mauchly's Test for Sphericity`


Effect W p p<.05
1 Treatment 0.836 0.489

$`Sphericity Corrections`
Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF] p[HF]<.05
1 Treatment 0.859 1.72, 15.47 1.11e-05 * 1.043 2.09, 18.77 2.61e-06 *

 The above results are dividied in to three parts. They are,


1. ANOVA result
2. Sphericity result
3. Correction result
1. ANOVA result: As p-value is 0.00000261 < 0.05,rejects NH,
means accepting the alternative means performance of the
students in 3 Tests is not same. There is a scope for post-hoc
test to know in which Test the students performed better.
Lastly, it provides a ges value i.e., 0.644, means effect size is
moderate, means the relationship between marks of student
and Type of Test is moderate.
2. Sphericity result: As the W value is 0.836 and is insignificant
i.e., p-value is 0.489 > 0.05, fails to reject NH of Sphericity
check. Here, the null hypothesis for Sphericity is “Sphericity is
not violated” and the alternative is “Sphericity is violated”. If the
p-value is >0.05, accept NH, means insignificant, means
Sphericity is not violated.
Suppose, if the Sphericity is violated, proceed for reporting
Greenhouse-Geisser(GGe) or Huynh-Feldt(HFe) correction is
reported. If the epsilon is below 0.75, report GGe results & if
the epsilon value is greater than 0.75, report HFe values.
Remember that if the epsilon is moving towards 0, Sphericity is
completely violated and if it is moving towards 1, Sphericity is
not all violated.
3. Corrections result: As the Mauchly’s test of sphericity value
i.e.,W that represents epsilon as 0.836 >0.75, the Huynh-
Feldt(HFe) correction can be reported even though the result is
significant. Here, the epsilon value of HFe is 1.043 indicating a
mild violation of Sphericity with significant result in p-value i.e.,
0.00000261<0.05. This concludes that there is a significant
affect of Treatment on the marks obtained by the students.
Overall, the results conclude that there is an impact of Type of Tests on the Marks obtained
by the students. Now, to know in which Test the students have perfomed better, proceeded
for post-hoc tests.
> pairwise_t_test(dt,Outcome~Treatment) # from rstatix package only
# A tibble: 3 × 9
. y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 Outcome 1 2 10 10 0.0000182 **** 0.0000364 ****
2 Outcome 1 3 10 10 0.000000398 **** 0.0000012 ****
3 Outcome 2 3 10 10 0.159 ns 0.159 ns

 The above results indicate that there is a significant difference in the performance of
students from Test 1 to Test 2 & Test1 to Test 3 but not from Test2 to Test 3.
> aggregate(dt$Outcome,by=list(dt$Treatment),mean)
Group.1 x
1 1 13.0
2 2 17.3
3 3 18.5
Finally, from the above mean scores, it is evident that students performed equivalently in test
2 and Test3, better than Test 1. Strictly speaking, students performed better in Test3 and
worst in Test1.

3.2.6 Karl-Pearson’s Coefficient of Correlation


Purpose: To analyze the direction and strength of relationship between two continuous
variables.
Usually, the people correlate things to understand them better like Height and weight, sales
and advertisement cost, price and demand etc.,
In general, correlation helps us in understanding the direction and strength of relationship
between any two continuous variables. The underlying assumption is that these two
variables must be normally distributed. If one of the two is not normally distributed, proceed
for Spearman’s Rank Correlation.

The correlation(r) value varies from -1 to +1 where -1 indicates perfectly negative correlation,
+1 indicates perfectly positive correlation and 0 indicates no correlation. The following are
the levels of correlation to be followed in interpreting the results.
If r is in between 0 and 0.3, there is a weak positive correlation
r is in between 0.3 and 0.7, there is a moderate positive correlation
r is in between 0.7 and 1, there is a strong positive correlation.
These ranges are applicable in negative direction too.

For example: if r=+0.73  positive strong correlation


r=-0.32  negative moderate correlation
r= 0  no correlation
Package: stats,psych
Function: cor(),cor.test(),pairs.panels()
Example:
> # karl-pearson's correlation
> dt<-read.csv(file.choose())
> str(dt)
'data.frame': 24 obs. of 5 variables:
$ Sales : int 12 14 15 17 18 18 20 13 17 19 ...
$ Region : chr "Guntur" "Guntur" "Guntur" "Guntur" ...
$ Brands : chr "BrandA" "BrandA" "BrandA" "BrandA" ...
$ Profits: int 8 10 11 12 12 12 13 9 9 15 ...
$ YoE : int 5 5 6 6 8 8 8 7 5 6 ...
> # Objective is to analyze the correlation between Sales and Years of Establishment (YoE)
of a firm.
> attach(dt)
The following objects are masked from dt (pos = 7):

Brands, Profits, Region, Sales, YoE

> # Assumption: Normality for two variables.


> shapiro.test(Sales)

Shapiro-Wilk normality test

data: Sales
W = 0.92128, p-value = 0.06236

> shapiro.test(YoE)

Shapiro-Wilk normality test

data: YoE
W = 0.95219, p-value = 0.302
> From the above results, it is evident that the two variables data is normally distributed.

> cor(Sales,YoE,method='pearson')
[1] 0.6242311
> # There is a positive moderate correlation between Sales and YoE.
> # As the above result is not tested its significances statistically,
> # let us perform the statistical significant test
> # NH: There is no correlation between Sales and YoE
> # AH: There is a correlation between Sales and YoE
> cor.test(Sales,YoE,method='pearson')

Pearson's product-moment correlation

data: Sales and YoE


t = 3.7478, df = 22, p-value = 0.001114
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2951593 0.8209118
sample estimates:
cor
0.6242311

> # The above results p-value is 0.001114 < 0.05,rejects NH.


> # This indicates that there is a statistically significant correlation between Sales and YoE.
> # Even, the correlation value is 0.6242311 indicates a positive moderate correlation
between two variables.
Note: As there is a significant correlation between two variables, can proceed for regression.

Summary

The present session helps in understanding the usage of ordinal and scale tests along with
their respective post-hoc tests based on normality. When the data is not normal or ordinal,
ordinal tests are used & When the data is normal, scale tests are applied. As a part of
ordinal tests, wilcoxon signed rank test is used for analyzing one sample group or paired
sample groups. Further, Mann-Whitney U test is for two sample groups, Kruskal-wallis test
for more than 2 sample groups, Friedman test for repeated responses. As a part of scale
tests, one sample t-test for one sample group, two sample t-test for two sample groups, one-
way ANOVA for morethan two sample groups, paired sample t-test for paired responses and
repeated measures ANOVA for repeated responses. Lastly, to find out the direction and
strength of relationship between two variables, karl pearson coefficient of correlation is used
when the variables data is normal and spearman’s rank correlation is used when data is not
normal or ordinal.

Self-Assessment Questions (SAQ)


1. Which of the following test is used for one sample group when the data is ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. None
2. Which of the following test is used for two sample groups when the data is ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. None
3. Which of the following test is used for more than 2 sample groups when the data is
ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. None
4. Which of the following test is used for paired response when the data is ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. None
5. Which of the following test is used for correlation when the data is ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. None
6. Which of the following test is used for correlation when the data is ordinal?
a. Mann-Whitney U test
b. Wilcoxon Signed Rank test
c. Kruskal-wallis test
d. Spearman’s Rank correlation
7. Which of the following test is used for correlation when the data is scale?
a. Mann-Whitney U test
b. karl Pearson’s Coefficient of Correlation
c. Kruskal-wallis test
d. Wilcoxon Signed Rank test
8. Which of the following test is suitable to find the correlation when data is ordinal with ties?
a. Kendall’s tau a
b. karl Pearson’s Coefficient of Correlation
c. kendall’s tau b
d. Spearman’s Rank correlation
9. Identify the test suitable for testing the normality of data.
a. Bartlett’s test
b. Levene’s test
c. Shapiro-wilks test
d. None
10. Identify the default two sample t-test when variances are unequal.
a. Two sample t-test
b. Paired Sample t-test
c. Welch’s two sample t-test
d. None

Terminal Questions

List out the tests meant for ordinal data


List out the tests meant for scale data.
Explain the significance of One way ANOVA with an example.
Illustrate an example for Repeated measures ANOVA.
Illustrate an example for karl pearson’s correlation and spearman’s rank correlation.

Answer Keys (for SAQ)


1 2 3 4 5 6 7 8 9 10
B A C B D D B C C C

Activity

Create a data set and work on different ordinal tests.


Create a dataset and work on different scale tests.

Glossary
Ordinal – data expressed in order or ranking
Scale – data expressed as continuous
Normality – express that the data is symmetry on both sides
Homogeneity – expresses about the variances equality
Post-hoc test – helps in knowing the significant differences across the categories

Bibliography
Davis. (2022). Statistical Testing with R. Vor Publications.

Kloke, J., & McKean, J. W. (2015). Non Parametric Statistical methods Using R. Taylor & Francis.

Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.

Sharma, S. (2018). Nursing Research and Statistics. Elsevier Health Services.

e-References
https://datatab.net/tutorial/friedman-test
https://datatab.net/tutorial/kendalls-tau

Video links
https://www.youtube.com/watch?v=iF8nHwLzlxg
https://www.youtube.com/watch?v=dwGzs1D4nyk

Image Credits - NA

Keywords

Ordinal, Interval, Ratio, scale,post-hoc test, normality, homogeneity, variance, correlation


Data Visualization in R

MODULE – IV

Data Visualization in R
Module 4
Module Description
In this module, the learning starts with the understanding of significance of Data visualization
in present world. It has become a very challenging area for career development too. This
data visualization is the process of presenting the data in different visual forms to understand
the data better. Usually, the data visualization is also referred as presentation of data. This
presentation of data is in two ways namely Graphical presentation of data and Diagrammatic
Presentation of Data. In Graphical Presentation of data, data is plotted for continuous
variables and categorical or discrete are plotted in Diagrammatic presentation of data. Under
Graphical Presentation of data, the module covers scatter plots, histogram, frequency
polygon, frequency curve and Ogives. Whereas in Diagrammatic Presentation of data, the
module covers one-dimensional diagrams using barcharts, two-dimensional diagrams using
piecharts, three-dimensional diagrams using cubes and cones, pictograms and cartograms.
The module introduces a new GUI in R i.e., R Commander –Rcmdr for projecting these plots
along with the usage of basic console window.
Aim
To make us understand different ways of presenting the data using R console and Rcmdr.
Instructional Objectives
This module includes:
 Data visualization tools
 Showcasing data graphically
 Showcasing data diagrammatically
Learning Outcomes
 Able to understand the significance of different data visualization tools
 Able to apply graphical presentation to data
 Able to apply diagrammatic presentation to data
Unit 4.1 Introduction to Data Visualization
In this module, the basic understanding of Data visualization and its types are discussed.
Further, it emphasizes on different data visualization tools available in the market. Latter, it
also talks about four types of Graphical presentation of data namely scatter plot, histogram,
line graphs and boxplots. Further, the module also covers barcharts and pie charts of
Diagrammatic presentation of data.

IV.1.1 Definition & Types


Data Visualization is a form of presenting the data in a visual form.

The basic purpose of doing so is to make every one understand your findings.
Always your boss may not be a tech savy. So, even a non-statistican can understand
this visualization. Second reason to project a large amount of data in a smart way is
the other major purpose of data visualization.
There are two major types :
1. Graphical presentation of data (Continuous data)
2. Diagrammatic presentation of data (Categorical data)
IV.1.2 Data visualization tools
There are several tools available for Data visualization in the market. Some of the
tools are as follows;
1. Excel
2. Tableau
3. Power BI
4. Qlik Sense
5. Google Data Studio
6. Grafana(Open-Source)
7. Python –matplotlib,seaborn (Open-Source)
8. R –graphics,ggplot2,lattice,plotly,Rcmdr (Open Source)
IV.1.3 Introduction to R Commander
R commander (Hutcheson, 2019) is a Graphical User Interface(GUI) in R created by John
Fox, a statistics professor, that helps in performing data analysis as well as visualization too.
It can be loaded using a package name ‘Rcmdr’.
3. Install Rcmdr package using install.packages(“Rcmdr”,dependencies=T)
4. load the package using command library(Rcmdr)
After successful loading, the outcome looks as follows;

Screenshot 1: View of Rcmdr window after loading

In th above Screenshot 1, you will see a screen with two windows. The upper window helps
in showcasing R Script or R Markdown and the lower one is the output window.

The Rcmdr helps in performing all the statistical analysis using parametric and non-
parametric tests under Statistics tab and all graphs under Graphs tab in the menu bar
options. The following screenshot 2 provides an idea of parametric tests and Screenshot 3
provides an idea of non-parametric tests as shown below.
Screenshot 2: Path to parametric tests

Screenshot 3: Path to Non-parametric tests


Note: Parametric and Non-parametric tests will become active when you import a dataset.
Now, the most relevant tab is ‘Graphs’ that helps in plotting different graphs like scatter plot,
histogram, barcharts, and piecharts.
Let us have a dataset with four variables namely Gender, Sections, Marks and Attendance
as shown below with 20 observations.
Section Attendanc
Gender s Marks e
Male SecA 18 85
Male SecB 17 75
Male SecC 14 65
Male SecA 16 70
Male SecB 15 68
Male SecC 18 84
Male SecA 19 89
Male SecB 20 95
Male SecA 17 74
Male SecB 16 65
Female SecC 12 60
Female SecA 11 55
Female SecB 10 60
Female SecC 17 76
Female SecA 12 65
Female SecB 13 66
Female SecA 14 70
Female SecB 19 85
Female SecA 19 90
Female SecB 20 96
Table 1: sample dataset
Let us import this dataset to work with graphs as shown below. The path goes as follows.
Path: Data  Import data  from textfile, clipboard…

Screenshot 4: path to import a dataset.


Once you clicked on ‘from text,clipboard..’ , it throws the following window.
Screenshot 6:Basic window with default settings
In the above basic window, change the default data set name from ‘Dataset’ to ‘dt127’ and
select ‘Comma’ radio button under Field Separator as shown below in Screenshot 7.

Screenshot 7: Window with modified settings in name and Field Separator


After selecting the options as shown in Screenshot 7, click on ‘OK’ to select or browse the
file from the computer as shown below.
Screenshot 8: Selecting the file from the computer
After selecting the file name i.e., daat127(here the file name given), click on open to get the
data file in to Rcmdr.
After clicking on ‘Open’ , the file moves to Rcmdr and is shown as follows.

Screnshot 9: Imported dataset view


You can see the data file name in the first tab of dataset in the box, beside to Edit dataset.
Now, once the data file name is visible in the first box, click on ‘Edit dataset’ to enable
another window to edit if any as shown in the Screenshot 9. If data is ok, click on ‘OK’ to
close editing window.
Then click on ‘View dataset’ option to view the confirmed dataset as shown in Screenshot
10.
Screenshot 10: Dataset view
If the data is ok, just close the viewed window and move back to Rcmdr window to obtain
different graphs.

Unit 4.2 Graphical Presentation of Data

In this type of presentation, the continuous data is presented in several forms. They are,
i. Scatter plot - Graphical presentation of correlation
ii. Histogram – A Continuous Barchart.
iii. Frequency Polygon – Connecting the middle points of histogram with straight
lines
iv. Frequency Curve – Smoothening of the Frequency polygon.
v. Ogives (Cumulative Frequency Curve)
Others like line graphs and box plots are also included here. Currently, Scatter plot,
histogram, line graphs and box plots are discussed here.
IV.2.1 Scatter plot
This plot is to showcase the relationship between two continuous variables In
general, it is also referred as a graphical plot of correlation.
In the present context w.r.t our dataset named ‘data127’, 2 varaibles are available
namely ‘Marks’ and ‘Attendance’. Being continuous in nature, correlation is possible.
To view this correlation, let us plot the scatter plot using Rcmdr as follows;
Path: Graphs  Scatter Plot
Screenshot 11: Scatter plot
Now, after selecting ‘Scatter plot’, choose ‘Attendance’ in x-variable and ‘Marks’
under y-variable as shown in the Screenshot 11. After selecting the variables, click
on ‘Apply’ to obtain the plot and click ‘OK’ to close the window to see the plot. The
outcome is shown in Screenshot 12.

Screenshot 11: Selecting variables for scatter plot


Screenshot 12: Outcome of scatter plot
From the Screenshot 12, it is evident that the movement of Attendance and Marks
are exhibiting a positive correlation. If you observe carefully, as and when the
Attendance is increasing, the marks are increasing for the candidates towards right-
hand side top corner.
Finally, from this scatter plot, the study concludes that there is a positive correlation
between Marks and Attendance.
Even, you can try this result confirmation by obtaining correlation value as shown
below in Screnshot12.

Screenshot 12: Correlation value


Here, the correlation value of Marks and Attendance is 0.93, a positive strong
correlation.
IV.2.2 Histogram
This graph helps in understanding the underlying distribution of data whether the
data is normally distributed or a skewed one. Usually, the histogram is drawn for
continuous variables and is also referred as a continuous barchart. In the present
context, the dataset has two continuous variables namely Marks and Attendance. For
these two variables, histogram is possible. Let us plot histogram for Marks variable.
Path: Graphs  Histogram

Screenshot 13: Path to Histogram

Screenshot 14: Selecting a Variable (i.e., Marks)


Now, after selecting ‘Marks’ variable, click on Apply to get the result and click
‘OK’ to close the window. The result is as shown below in Screenshot 15.
Screenshot 15: Histogram
If you want the labels on x-axis and y-axis along with title, click on ‘options’ tab in
histogram and provide the details to get the following result.

Screeshot 16: Histogram with labels and title


From the result in Screenshot16, it is evident that the histogram is neither completely
positive skewed nor negative skewed, to some extent it is normally distributed.
Confirm the normality using ‘Test of normality’ option under ‘Statistics’ option in menu
bar to get the normality result as shown in Screenshot 17.
Screenshot 17: Normality result for Histogram (as p-value of Shapiro is 0.2406
>0.05 –Data is normal)
Let us produce histograms group-wise. In this context, histograms can be produced
gender wise as well as section-wise as shown in the Screenshot 18 and 19
respectively.

Screenshot 18: Gender-wise Histograms


Screenshot 19: Section-wise Histogram
IV.2.3 Line graphs
The line graphs are drawn by connecting the data points with lines. Usually, the
growth or decline of performance is projected in line graphs.
Example:
Compare the performance of two companies in terms of profits having data from
2012 to 2022 as follows.
Yea 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
r
C1 8 12 11 14 16 15 18 19 20 19 22
C2 10 14 15 17 16 18 19 21 22 24 23
Table: 2 Performances of Company 1 and Company 2
There are two ways of creating line graphs for this comparison.
vi. Using Rcmdr
vii. Using R Console
i. Using Rcmdr
In order to draw the line graphs, first save the above Table 2 data in a csv file as
data128.csv(named here) by placing the data as shown below;
Company Company
Year 1 2
2012 8 10
2013 12 14
2014 11 15
2015 14 17
2016 16 16
2017 15 18
2018 18 19
2019 19 21
2020 20 22
2021 19 24
2022 22 23
Table3: Formatted table
Now, import the csv file in to Rcmdr as discussed in the previous exercises. After importing
the file confirm the file in Edit dataset and View dataset. Once, you could view the imported
dataset, proceed for drawing line graphs. Let us have the path for line graph in Rcmdr.
Path: Graphs  Line graphs (Click on this).
After clicking on Line graphs, it will exhibit a window with two slots where you need to select
x variables and y variables. As a part of x variable, pick up ‘Year’ variable and as y
variable(s), pick up Company 1 and Company 2 variables as shown below;

Screenshot 20: Selecting Variables for line graph


After selecting the variables, click on ‘Apply’ tab and click ‘OK’ to close the current window
and to view the output. The output appears as follows in Screenshot 21.

Screenshot 21: Comparison of performance of two companies- Company.1 and Company.2


The plot shows that both companies have gone through the phases of ups and downs and
reached a level where Company.1 started growing and Company.2 started declining. The
companies have to see why it is happening to take a decision.

ii. Using R Console

> # Compare the performance of two companies using line graphs


> # First, import the data
> dt<-read.csv(file.choose())
> str(dt)
'data.frame': 11 obs. of 3 variables:
$ Year : int 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 ...
$ Company.1: int 8 12 11 14 16 15 18 19 20 19 ...
$ Company.2: int 10 14 15 17 16 18 19 21 22 24 ...
> dt
Year Company.1 Company.2
1 2012 8 10
2 2013 12 14
3 2014 11 15
4 2015 14 17
5 2016 16 16
6 2017 15 18
7 2018 18 19
8 2019 19 21
9 2020 20 22
10 2021 19 24
11 2022 22 23
> # After importing the dataset, use following functions to create the plots.
> # plot() for first line graph
> # lines() to add second line to the existing plot
> # legend() to add legend to the plot
> # note: always draw the plot for the data with max range that covers both datasets.
># yet times, it is better to add 'ylim' argument to cover the range of plot values.
> # let us start
> attach(dt)
># Draw first line graph
>plot(x=Year,y=Company.1,type='b',lty=2,lwd=2,pch=16,col='blue',ylim=c(5,25),
xlim=c(2012,2022),ylab='Performance of two companies in Profits',main='Comparative
Analysis')
> # Decision regarding ylim is yours based on range of your entire dataset.
> # In my dataset min.value is 8 i.e., in Company.1 and max.value is 23 i.e., inCompany.2,
so pick up y limits so a that this range is covered. The study has taken (5,25), it can be
(6,24) also. It is to your discretion but remember data range must be covered.
> # Add second line to the plot
> lines(x=Year,y=Company.2,type='b',lty=2,lwd=2,pch=16,col='red')
> # Add legend to the plot.
> legend('topleft',legend=c('Company.1','Company.2'),col=c('blue','red'),lty=c(2,2),lwd=c(2,2))
Now, the plot looks as shown below in Fig.1

Fig:1 Performance of two companies

Therefore, this is how a line graph is drawn.

IV.2.4 Box plots

In this plot, the reader can understand the distribution of data and most importantly
outliers in the data variable are identified. Usually, it is referred as Box and Whisker
plot. Mainly, it provided five observations for the identified range in the data namely
Min.value, Quartile 1(Q1), Quartile 2 (Q2 or median) , Quartile 3(Q3) and Max.value.

Box plots can be drawn for two cases.


. i. Single variables ( One continuous variable)
. ii. Multiple variables or numeric dataframe (Multiple Continuous variables)
.
Generally, box plots are drawn for continuous variables.
i. Single variables ( One continuous variable)

These box plots can be drawn either by using Rcmdr or R console too.
Let us draw a box plot for Attendance using Rcmdr, as shown below
Path: Graphs Boxplot…
Fig: 2 Box plot of Attendance Fig: 3 Boxplot of Marks

From the above two box plots placed under Fig:2 and Fig:3, their distribution seems to be
positively skewed for Attendance and Slightly negatively skewed for Marks. Further, it can be
concluded that there are no outliers as you could not see any circles placing outside the
whiskers.
Instead of Rcmdr, if you perform the same after importing the same dataset to R console,
you will get more outcomes too.
In R console perform the following to get Box plot as shown in Screenshot 22.

Screenshot 22: Outcome of Box plot for Attendance variable.


In the above Screenshot 22, in the right hand side, it is clear that the tool identified a range
for the Attendance data and found that 55 is min.value, 65 is Q1, 72 is Q2, 85 is Q3 and 96
is the max. value. When you look at the Screenshot 23, it is evident from $out showing
numeric(0), indicates that there are no outliers in the data variable.

Screenshot 22: Boxplot of Attendance


Similarly, you try to perform on Marks to have better understanding of the concept.

ii. Multiple variables or numeric dataframe (Multiple Continuous variables)


In this scenario, for more than one variable boxplots are drawn. In the sense, you can
directly use dataframe with variables. Let us draw boxplots for both Attendance and Marks at
a time.
Use the same path for obtaining multiple boxplots as follows;
Path: Graphs Boxplot…
Now, in the variable selection window of Box plot, choose Attendance and Marks too.

Fig: 3 Two Box plots in One Slide without outliers


If you obseve Fig.3, two boxplots are ploted with different scale limits as they have different
changes. Even it is observed that no outliers in the two varaibles as numeric(0) is for
Atendacne and Marks.
Let us see a scenario with outliers using mtcars dataset.

Screenshot 24: mtcars wt variable with 2 outliers (circles outside upper whisker)
It is clearly evident that there are 2 outliers and they are 3.65 and 5.25 values in the variable.
Once outliers are identified, either you need to remove them or replace them with mean
score, a kind of imputation.
In basic R console, use boxplot(dataframe) syntax to plot multiple boxplots in one go. Do
remember that they should be continuous in nature.

Unit 4.3 Diagrammatic Presentation of Data

In this presentation, the categorical or discrete data is presented. As we know, the


diagrammatic presentation of data (Sharma, Nursing Research and Statistics-E-Book, 2022)
includes one-dimensional diagrams like barcharts, two –dimensional diagrams like pie-chart,
three-dimensional diagrams like cubes or cyinders, pictograms(data presented in pictures)
and cartograms(data presented in maps), the study is limited to most used one like bar
charts and pie-charts only.

4.3.1 Barcharts

The Barcharts are drawn for categorical or discrete data. In the sense, categorical means
number of male and female in a class or number of students studying in the class come from
different regions. In case of discrete, number of cars passing a toll gate, number of patients
visiting a clinic, number of tourists visiting a place etc.,
These barcharts are of three types. They are as follows;
1. Simple Barchart
2. Sub-divided Barchart (Or Stacked Barchart)
3. Multiple Barchart (Or Grouped Or Side-by-Side Or Parallel)
Let us have them from Rcmdr for the same dataset used for scatter plot.
1. Simple Barchart
Let us have a simple Barchart for Gender and Sections as follows in Screnshot 25.
Path: Graphs Bar graph…

Screenshot 25: Path of Barchart

After clicking on ‘Bar garph’, you will get a window with two tabs namely ‘Data tab’ and
‘Options tab’. select ‘Gender’ in Data tab and select

Screenshot 26: Selecting variable for Bar graph in Data tab


Screenshot 27: Options tab in Bar graph (Provide x label and y labels)

After selecting variable in Data tab and adding x and y labels in Options tab, click on ‘Apply’
to get the result and click ‘OK’ to close the Bar graph window. The result is as follows;

Fig: 5 Barchart or Bar graph of Gender

Task: Perform the same with Sections at your end.


2. Sub-divided Barchart (Or Stacked Barchart)
In this chart, the data of two categorical variables can be assessed. In the present example,
there are two variables namely Gender and Sections. Here, two plots are possible i.e.,
Gender by Section and Section by Gender.
Path: Graphs Bar graph…
i. Gender by Section – Section-wise Gender distribution is presented.
ii. Section by Gender– Gender-wise distribution is presented.

The above two possibilities are plotted below in Screenshot 28 and Screenshot 29.

Screenshot 28: Gender by Section-Selection

Screenshot 29: Gender by Section-Outcome


Screenshot 30: Section by Gender –Selection

Screenshot 31: Section by Gender -Outcome

3. Multiple Barchart (Or Grouped Or Side-by-Side Or Parallel)

In this chart, that data project in bars is grouped together based on categorical variables.
Here also, there are two ways of projecting the data in bars. They are,

Before proceeding for Multiple Bar graphs to obtain, need to change the default setting from
divisible( or stacked ) to Side-by-Side option as shown below.
Screenshot 32: Changing stacked to side-by-side option in Options tab

i. Section by Gender

Screenshot 33: Sections by Gender –Selection


Screenshot 34: Sections by Gender-Outcome

ii. Gender by Section

Screenshot 35: Gender by Section- Selection


Screenshot 36: Gender by Section – Outcome

4.3.2 Piechart -2D

The present chart named pie-cjhart is to project the proportion or percentage distribution.
For Example, Plotting the market share of Top 5 ERP vendors in 2023.
> # A two dimensional diagram
> # pie-chart 2D
> # Using Market share data of 2023,according to software connect.
> ms<-c(24.6,21,15.1,9.4,5.3,24.6)
>lbs<-c("Microsoft - 24.6%","SAP-AG - 21%","Oracle -15.1%","Sage - 9.4%","Infor -
5.3%","Others-24.6%")
> pie(ms,labels=lbs,clockwise=T,col=rainbow(6))
> legend('center',legend=lbs,fill=rainbow(6,main=”Market share of Top 5 ERP vendors”)
Screenshot 37: Pie-chart - 2D of ERP vendors with market share

4.3.3 Piechart -3D

In order to plot a pie-chart in 3D, plotrix package is used. After installation of package,
pie3D() is used. To get a better outcome , explode argument is used.

> ms<-c(24.6,21,15.1,9.4,5.3,24.6)
> lbs<-c("Microsoft - 24.6%","SAP-AG - 21%","Oracle -15.1%","Sage - 9.4%","Infor -
5.3%","Others-24.6%")
> # In 3D plot of piechart, no argument of clockwise.
> pie3D(ms,labels=lbs,col=rainbow(6),main="Market Share of Top 5 ERP vendors")
The outcome is as follows;
Fig: 4 Market Share of Top 5 ERP vendors
Fig: 5 Market share of Top 5 ERP vendors with Explode Argument

Self –Assessment Questions


1. How many type of data visualization types are discussed in the current module?
a. 3 b. 2 c. 4 d. 5
2. Which of the following comes under graphical presentation of data?
a. Scatter plot
b. Barchart
c. Frequency Curve
d. a & c
3. Which of the following comes under diagrammatic presentation of data?
a. Scatter plot
b. Barchart
c. Frequency Curve
d. a & c
4. Which of the following plots used for proportions presentation?
a. barchart
b. piechart
c.scatter plot
d.none
5. Basically, how many types of barcharts are discussed in the module?
a. 2 b. 3 c.4 d.1
6. What is the other name of stacked barchart?
a. multiple barchart
b. grouped barchart
c. subdivided barchart
d. side-by-side barchart
7. What are the other names of multiple barchart?
a. grouped barchart
b. Side-by-Side barchart
c. Parallel barchart
d. All of the above
8. Which package is used for plotting pie3D?
a. plotrix b.graphics c.stats d.none
9. Which of the following are packages meant for data visualization in python?
a. plotly b.matplotlib c. seaborn d. b&c
10. Which is the top data visualization tool in the world as on date?
a.Tableau b. Power BI c.QlikSense d. Google DataStudio

Summary

The present module talks about data visualization, as presenting data in a visual form as an
important area not only for statistician but also for a non-statistician for better understanding
the current situation of a firm or a business. As a part of data visualization, two types of
presentation of data namely graphical presentation of data and diagrammatic presentation of
data. Under graphical presentation of data, scatter plot, histogram , line grahs and box plots
are discussed and under diagrammatic presentation of data, barcharts and its types along
with piechart in 2D and 3D are discussed using R Commander(Rcmdr) in R, a GUI. Overall,
the other typs of graphics namely grid graphics, lattice graphics and others alike can be used
based on the requirement.

Terminal Questions

1. Describe the significance of data visualization.


2. Elaborate the types of graphical presentation of data.
3. Elaborate the types of Diagrammatic presentation of data.
4. Discuss the significance of data visualization tools in the market.

Answer Keys
1 2 3 4 5 6 7 8 9 10
B D B B C C D A D A

Activity
Create a sample data set to apply graphical presentation of data
Create a sample dataset to apply diagrammatic presentation of data

Glossary

Data visualization – presenting the data in a visual form


Graph – Presenting data in the form of lines or dots
Chart – Presenting data in the form of bars and columns
Histogram – A continuous bar chart meant for continuous variables.
Barchart – A chart meant for categorical or discrete data
Piechart – Chart meant for presenting proportions or percentages

Bibliography
Hutcheson, G. (2019). Data Analyis using R CommanderAn Introduction to R Commander. Sage
Publications.

Mehemetoglu, M., & Mittner, M. (2021). Applied Statistics Using R:A Guide for Social Sciences. Sage
Publications.

Sharma, S. (2022). Nursing Research and Statistics-E-Book. Elsevier Health Sciences.

e-References

https://www.edureka.co/blog/tutorial-on-importing-data-in-r-commander/
https://www.javatpoint.com/r-data-visualization
https://ladal.edu.au/dviz.html

Video links
https://www.youtube.com/watch?v=MAWY51fI01o
https://www.youtube.com/watch?v=C9_zac1LQ9o
https://www.youtube.com/watch?v=DLbu8HdywAk

Image Credits -NA

Keywords
 Visualization scatter plot histogram
 Barchart piechart line graph
 Boxplot

You might also like