Compute Summary Statistics In R
Last Updated :
02 Aug, 2024
Summary statistics provide a concise overview of the characteristics of a dataset, offering insights into its central tendency, dispersion, and distribution. R Programming Language with its variety of packages, offers several methods to compute summary statistics efficiently. Here we'll explore various techniques to compute summary statistics in R. Here are some Techniques to Compute Summary Statistics:
- Base R provides several built-in functions for computing summary statistics, including summary(), mean(), median(), min(), max(), quantile(), sd(), and var().
- These functions offer basic summary statistics such as minimum, maximum, median, mean, standard deviation, and variance for a given dataset.
2. Using External Packages
- R offers various external packages that extend the functionality of base R for computing summary statistics.
- The psych package provides the describe() function, which offers a more comprehensive summary of the dataset, including measures like skewness, kurtosis, and interquartile range (IQR).
- Packages like dplyr and data.table offer functions for computing summary statistics for grouped data and performing complex data manipulation tasks efficiently.
3.Grouping Data for Summary Statistics
- Grouping data allows us to compute summary statistics for subsets of the dataset based on one or more grouping variables.
- We can use the group_by() function from the dplyr package to group data by one or multiple variables and then compute summary statistics for each group using functions like summarise().
4.Summarising Multiple Variables
- Sometimes, we may want to summarise multiple variables simultaneously.
- The summarise() function from the dplyr package allows us to compute summary statistics for multiple variables at once, such as mean, median, standard deviation, etc.
5.Additional Statistical Summary Functions
- R offers additional functions for computing useful summary statistics beyond the basic measures provided by base R functions.
- Functions like skewness(), kurtosis(), and IQR() compute measures of skewness, kurtosis, and interquartile range (IQR), respectively, providing deeper insights into the distribution of the data.
Compute Summary Statistics In R
Step 1: Install required packages
R
install.packages(c("dplyr", "data.table"))
install.packages("e1071")
library(e1071)
library(dplyr)
library(data.table)
Step 2: Load the Dataset
R
# Load the mtcars dataset
data(mtcars)
Step 3: Summary Statistics of Ungrouped Data
Computing summary statistics for the entire dataset. We'll use base R functions like summary(), mean(), median(), etc.
R
# Summary statistics for ungrouped data
cat("Summary statistics for mpg variable:\n")
summary(mtcars$mpg)
cat("\nMean of mpg:", mean(mtcars$mpg), "\n")
cat("Median of mpg:", median(mtcars$mpg), "\n")
cat("Minimum value of mpg:", min(mtcars$mpg), "\n")
cat("Maximum value of mpg:", max(mtcars$mpg), "\n")
cat("Quantiles of mpg:", quantile(mtcars$mpg), "\n")
cat("Standard deviation of mpg:", sd(mtcars$mpg), "\n")
cat("Variance of mpg:", var(mtcars$mpg), "\n")
Output:
Summary statistics for mpg variable:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
Mean of mpg: 20.09062
Median of mpg: 19.2
Minimum value of mpg: 10.4
Maximum value of mpg: 33.9
Quantiles of mpg: 10.4 15.425 19.2 22.8 33.9
Standard deviation of mpg: 6.026948
Variance of mpg: 36.3241
Step 4: Summary Statistics of Grouped Data by one Variable
R
# Group by one variable (cylinders) and compute summary statistics
mtcars %>%
group_by(cyl) %>%
summarise(
mean_mpg = mean(mpg),
median_mpg = median(mpg),
sd_mpg = sd(mpg)
)
Output:
# A tibble: 3 × 4
cyl mean_mpg median_mpg sd_mpg
<dbl> <dbl> <dbl> <dbl>
1 4 26.7 26 4.51
2 6 19.7 19.7 1.45
3 8 15.1 15.2 2.56
Group by Multiple Variables
R
# Summarise multiple variables
mtcars %>%
summarise(
mean_mpg = mean(mpg),
mean_disp = mean(disp),
sd_hp = sd(hp),
var_wt = var(wt)
)
Output:
mean_mpg mean_disp sd_hp var_wt
1 20.09062 230.7219 68.56287 0.957379
Step 5: Additional Summary Functions
Additional functions for computing useful summary statistics, such as skewness, kurtosis, and interquartile range (IQR).
R
# Additional statistical summary functions
print("Computing skewness for the mpg variable...")
skewness(mtcars$mpg)
print("Computing kurtosis for the mpg variable...")
kurtosis(mtcars$mpg)
print("Computing interquartile range (IQR) for the mpg variable...")
IQR(mtcars$mpg)
Output:
[1] "Computing skewness for the mpg variable..."
[1] 0.610655
[1] "Computing kurtosis for the mpg variable..."
[1] -0.372766
[1] "Computing interquartile range (IQR) for the mpg variable..."
[1] 7.375
Computing summary statistics in R is essential for understanding the characteristics of a dataset. Whether it's ungrouped or grouped data, R provides powerful tools like dplyr and data.table to compute these statistics efficiently. By exploring these techniques, analysts can gain valuable insights into their data, aiding in decision-making and further analysis.
Similar Reads
How To Calculate Summary Statistics In Pandas
Pandas, an incredibly versatile data manipulation library for Python, has various capabilities to calculate summary statistics on datasets. Summary statistics can give you a fast and comprehensive overview of the most important features of a dataset. In the following article, we will explore five me
4 min read
How to Calculate Summary Statistics by Group in R?
In this article, we will discuss how to calculate summary statistics by the group in the R programming language. What is summary statistics in R?Summary Statistics by Group in R Programming Language are numerical or graphical representations that provide a concise and informative overview of a datas
5 min read
How to get summary statistics by group in R
In this article, we will learn how to get summary statistics by the group in R programming language. Sample dataframe in use: grpBy num 1 A 20 2 A 30 3 A 40 4 B 50 5 B 50 6 C 70 7 C 80 8 C 25 9 C 35 10 D 45 11 E 55 12 E 65 13 E 75 14 E 85 15 E 95 16 E 105Method 1: Using tapply() tapply() function in
6 min read
Descriptive Statistic in R
Data analysis is a crucial part of any machine learning model development cycle because this helps us get an insight into the data at hand and whether it is suitable or not for the modeling purpose or what are the main key points where we should work to make data cleaner and fit for future uses so,
8 min read
Compute Summary Statistics of Subsets in R Programming - aggregate() function
In R programming, aggregate() function is used to compute the summary statistics of the split data. It takes the data frame or time series analysis object. Syntax: aggregate(x, by, FUN) Parameters: x: specifies R object by: specifies list of grouping elements FUN: specifies function to compute the s
2 min read
How to find group-wise summary statistics for R dataframe?
Finding group-wise summary statistics for the dataframe is very useful in understanding our data frame. The summary includes statistical data: mean, median, min, max, and quartiles of the given dataframe. The summary can be computed on a single column or variable, or the entire dataframe. In this ar
4 min read
Descriptive Statistics in Excel
Obtaining descriptive statistics for data collection may be helpful if you frequently work with huge datasets in Excel. A few key data points are provided by descriptive statistics, which you can utilize to quickly grasp the complete data set. Although you can calculate each of the statistical varia
4 min read
Descriptive Statistics
Descriptive statistics is a subfield of statistics that deals with characterizing the features of known data. Descriptive statistics give summaries of either population or sample data. Its primary aim is to define and analyze the fundamental characteristics of a dataset without making sweeping gener
13 min read
SQL - Statistical Functions
SQL statistical functions are essential tools for extracting meaningful insights from databases. These functions, enable users to perform statistical calculations on numeric data. Whether determining averages, sums, counts, or measures of variability, these functions empower efficient data analysis
4 min read
How to Calculate Test Statistic
In statistical hypothesis testing, a test statistic is a crucial tool used to determine the validity of the hypothesis about a population parameter. This article delves into the calculation of test statistics exploring its importance in hypothesis testing and its application in real-world scenarios.
8 min read