Compute Empirical Cumulative Distribution Function in R

Compute Empirical Cumulative Distribution Function in R

Last Updated : 16 Apr, 2025

The Empirical Cumulative Distribution Function (ECDF) is a non-parametric method for estimating the Cumulative Distribution Function (CDF) of a random variable. Unlike parametric methods, the ECDF makes no assumptions about the underlying probability distribution of the data.

It is defined as a step function that increases by \frac{1}{n}at each observed data point, where n is the total number of observations in the dataset.

The ECDF is a useful tool for visualizing the distribution of a dataset and can provide insights into the underlying distribution that would be difficult to obtain through traditional summary statistics.

Prerequisites

Before getting into the details of the Empirical Cumulative Distribution Function (ECDF), it’s important to understand a few foundational concepts related to probability distributions:

1. Probability Density Function (PDF)

The probability density function (PDF) describes the probability distribution of a continuous random variable. It is defined as the derivative of the cumulative distribution function (CDF):

f(x) = \frac{d}{dx}F(x)

Here, F(x) is the cumulative distribution function and f(x) represents the rate at which the probability accumulates with respect to x.

2. Probability Mass Function (PMF)

The probability mass function (PMF) is used for discrete random variables. It gives the probability that the variable takes a specific value:

P(X = k)

where X is a discrete random variable and k is a particular value in its range.

3. Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) gives the probability that a random variable X is less than or equal to a particular value x. It is defined as:

F(x) = P(X \leq x)

The CDF can be defined for both discrete and continuous variables and it is a non-decreasing function ranging from 0 to 1.

Mathematical Concept of ECDF

The ECDF is defined as follows:

Let x_1, x_2, ..., x_n be a random sample of size n from a distribution with CDF F(x). The ECDF is given by:

\hat{F_n}(x) = \frac{1}{n} \sum_{i=1}^{n} I_{(-\infty, x]}(x_i\leq x)

where,

\mathbf{I}_{(x_i \leq x)} = \begin{cases}1 & \text{if } x_i \leq x \\0 & \text{otherwise}\end{cases}.

Mean and Variance

The ECDF can be used to estimate the mean and variance of a distribution.
The mean of the distribution can be estimated as the area under the curve of the ECDF:

\hat{\mu}_n = \int_{-\infty}^{\infty} x \, d\hat{F}_n(x)

The variance of the distribution can be estimated as:

\hat{\sigma}_n^2 = \int_{-\infty}^\infty (x - \hat{\mu}_n)^2 \, d\hat{F}_n(x)

The ECDF can also be used to estimate confidence intervals for the CDF, which can be useful for hypothesis testing and parameter estimation.

Properties of ECDF

The ECDF has several useful properties:

It is a non-parametric estimate of the CDF, meaning it can be applied to a wide variety of distributions without making assumptions about their shape or parameters.
It is consistent, meaning that as the sample size increases, the ECDF converges to the true CDF.
It is unbiased, meaning that on average, the ECDF estimates the true CDF.
It is a step function, which makes it useful for visualizing the distribution of a dataset.

Now, let's move on to some examples of how to compute and plot the ECDF. Before starting this tutorial, you need to have a basic understanding of R language and its data structures. You should also have the latest version of R installed on your computer.

ECDF Computation and Plotting in R

To compute and plot the Empirical Cumulative Distribution Function (ECDF) in R , we generate sample data, compute ECDF using the ecdf() function and plot the result.

R

install.packages("stats")
library(stats)
data      <- rnorm(1000, mean = 50, sd = 10) 
ecdf_func <- ecdf(data) 
plot(ecdf_func, 
     xlab = "Data", 
     ylab = "Cumulative Probability", 
     main = "Empirical Cumulative Distribution Function")

Output:

Empirical Cumulative Distribution Function -Geeksforgeeks — Empirical Cumulative Distribution Function Plot

Example 1: Computing and Plotting the ECDF for a Simple Dataset

Suppose we have a set of 10 data points: 1, 2, 3, 4, 4, 5, 6, 7, 8 and 9. We want to compute the ECDF of this data set.

Manually, we would first sort the data in ascending order: 1, 2, 3, 4, 4, 5, 6, 7, 8, 9. Then, for each value of x, we would count the number of observations that are less than or equal to x and divide by the total number of observations.

To compute the ECDF at x=5, we would count the number of observations that are less than or equal to 5, which is 6. Dividing by the total number of observations, we get \hat{F}(5) = \frac{6}{10} = 0.6. We would repeat this process for all values of x.

1. Sort the data

The first step is to sort the data in ascending order and calculate the number of data points:

R

data <- c(1, 2, 5, 4, 4, 3, 6, 7, 8, 9)
sorted <- sort(data, decreasing = FALSE)
n = length(sorted)
paste('Length :', n)
print(sorted)

Output:

'Length : 10'
[1] 1 2 3 4 4 5 6 7 8 9

2. Compute the ECDF using a Loop

To compute the ECDF, we need to loop over each data point in the sorted dataset and calculate the proportion of data points that are less than or equal to that point:

R

ecdf_func <- function(data) {
    Length <- length(data)
    sorted <- sort(data)
    
    ecdf <- rep(0, Length)
    for (i in 1:n) {
        ecdf[i] <- sum(sorted <= data[i]) / Length
    }
    return(ecdf)
}

ecdf <- ecdf_func(data)
print(ecdf)

Output:

[1] 0.1 0.2 0.3 0.5 0.5 0.6 0.7 0.8 0.9 1.0

3. Compute ECDF using ecdf() function

In R, we can compute the ECDF using the built-in ecdf() function:

R

ecdf_fun <- ecdf(data)
ecdf_ <- ecdf_fun(data)
print(ecdf_)

Output:

[1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0

4. Check whether both ecdf values are identical or not

R

identical(ecdf, ecdf_)

Output:

TRUE

The two methods produce the same result, as can be seen by comparing the outputs of ecdf and ecdf_. The empirical cumulative distribution function assigns a probability of 0.1 to the smallest value in the data, a probability of 0.2 to the second smallest value and so on. The largest value in the data has a probability of 1.0.

6. Plot the ECDF

We can also plot the ECDF using the plot() function:

R

plot(data, ecdf_func(data), main="Custom Empirical Cumulative Distribution Plot", xlab="Data Points", ylab="ECDF Value")

Output

Custom Empirical Cumulative Distribution Function Plot - Geeksforgeeks — Custom Empirical Cumulative Distribution Function Plot

Example 2: ECDF of Normally distributed data

Suppose we have a dataset of 100 observations that follows a normal distribution with a mean 0 and a standard deviation of 1. We want to compute the ECDF of this dataset and plot it.

1. Generate the data

We generate a dataset of 100 observations that follows a normal distribution with mean 0 and standard deviation 1. In R, we can use the rnorm() function to generate random normal data:

R

set.seed(123)
data <- rnorm(100, mean = 0, sd = 1)

Here, we set the random seed to ensure reproducibility and generate 100 random numbers from a normal distribution with a mean of 0 and a standard deviation of 1. The resulting data object is a vector of length 100.

2. Compute the ECDF values

For each value of x, we want to compute the estimated probability that a data point in the dataset is less than or equal to x. We can compute the ECDF values manually using a for loop or using ecdf() function.

R

ecdf_func <- function(data) {
    Length <- length(data)
    sorted <- sort(data)
    
    ECDF <- rep(0, Length)
    for (i in 1:Length) {
        ECDF[i] <- sum(sorted <= data[i]) / Length
    }
    return(ECDF)
}

ecdf1 <- ecdf_func(data)
plot(data, ecdf1, xlab="Data", ylab="Cumulative Probability", 
     main="Empirical Cumulative Distribution Function")

Output:

Empirical Cumulative Distribution Function-Geeksforgeeks — Empirical Cumulative Distribution Function

3. Check with ecdf() function identical or not

R

ecdf_fun <- ecdf(data)
ecdf <- ecdf_fun(data)
identical(ecdf1, ecdf)

Output:

TRUE

4. Compute the cumulative normal distribution with new data

First, we define a sequence of values for x. For each value of x, we want to compute the true probability that a standard normal random variable is less than or equal to x. This can be done using the standard normal CDF and use the pnorm() function to compute the true CDF values for each value of x. we use the same sample mean and standard deviation here also.

R

x <- seq(-4, 4, length.out = 100)

prob <- pnorm(x, mean = 0, sd = 1)
plot(prob,  xlab="Data", ylab="Probability", 
     main="Cumulative Normal Distribution")

Output:

Cumulative Normal Distribution -Geeksforgeeks — Cumulative Normal Distribution

5. Compute ecdf for x using the function ecdf_fun() and Plot both cdf and ecdf on same plot

We can plot the true CDF values and ECDF values on the same plot to visualize how closely they match. Here, we use the plot() function to create a line plot with x-values from -4 to 4 and y-values corresponding to the true CDF values in blue and the ECDF values in red. We also add a legend to the plot to distinguish between the two lines.

R

ecdf = ecdf_fun(x)
plot(x, prob, type = "l", col = "blue", lwd = 2, 
     xlab = "x", ylab = "Cumulative probability", 
     main = "True CDF vs ECDF")
lines(x, ecdf, type = "l", col = "red", lwd = 2)
legend("bottomright", legend = c("True CDF", "ECDF"),
       col = c("blue", "red"), lwd = 2)

Output:

True CDF vs ECDF

We first generate the normal data using the rnorm() function. Then, we compute the sample mean and standard deviation using the mean() and sd() functions. We then define a sequence of values for x and use the pnorm() function to compute the true CDF values for each value of x. We also compute the ECDF manually using a for loop and the sum() function. Finally, we plot both the true CDF.

Compute Empirical Cumulative Distribution Function in R

I

irshadahmad

Improve

Article Tags :

R Language

Similar Reads

Plot Cumulative Distribution Function in R

In this article, we will discuss how to plot a cumulative distribution function (CDF) in the R Programming Language. The cumulative distribution function (CDF) of a random variable evaluated at x, is the probability that x will take a value less than or equal to x. To calculate the cumulative distri

Compute the Value of Empirical Cumulative Distribution Function in R Programming - ecdf() Function

ecdf() function in R Language is used to compute and plot the value of Empirical Cumulative Distribution Function of a numeric vector. Syntax: ecdf(x) Parameters: x: Numeric Vector Example 1: Python3 1== # R Program to compute the value of # Empirical Cumulative Distribution Function # Creating a Nu

Cumulative Distribution Function

Cumulative Distribution Function (CDF), is a fundamental concept in probability theory and statistics that provides a way to describe the distribution of the random variable. It represents the probability that a random variable takes a value less than or equal to a certain value. The CDF is a non-de

How to create a plot of cumulative distribution function in R?

Empirical distribution is a non-parametric method used to estimate the cumulative distribution function (CDF) of a random variable. It is particularly useful when you have data and want to make inferences about the population distribution without making any assumptions about its form. In this articl

Compute the value of F Cumulative Distribution Function in R Programming - pf() Function

pf() function in R Language is used to compute the density of F Cumulative Distribution Function over a sequence of numeric values. It also plots a density graph for F Cumulative Distribution. Syntax: pf(x, df1, df2) Parameters: x: Numeric Vector df: Degree of Freedom Example 1: Python3 1== # R Prog