Open In App

How To Calculate Summary Statistics In Pandas

Last Updated : 16 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Pandas, an incredibly versatile data manipulation library for Python, has various capabilities to calculate summary statistics on datasets. Summary statistics can give you a fast and comprehensive overview of the most important features of a dataset. In the following article, we will explore five methods of computing summary statistics using Pandas.

Using describe() for Descriptive Statistics

The describe() method is a strong method to generate descriptive statistics of a DataFrame. The describe() method will provide you with detailed summary statistics including count, mean, standard deviation, min, 25th percentile, median (50th percentile), 75th percentile, and max.

Python
import pandas as pd

d = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)

# Using describe() to calculate summary statistics
res = df.describe()
print(res)

Output


A B
count 5.000000 5.000000
mean 3.000000 20.000000
std 1.581139 7.905694
min 1.000000 10.000000
25% 2.000000 15.000000
50% 3.000000 20.000000
75% 4.000000 25.000000
max 5.000000 30.000000

Explanation: In this, we create a dictionary with numerical values and convert it to a Pandas Dataframe. We then used the describe() function on the DataFrame which provides a summary of key statistics including count, mean, standard deviation and percentiles.

Calculating mean, median and mode

Pandas also has distinct functions to calculate the mean, median and mode of each column in a DataFrame.

Python
import pandas as pd

d = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)

# Calculating mean, median, and mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0] 

print(mean_values)
print(median_values)
print(mode_values)

Output
A     3.0
B    20.0
dtype: float64
A     3.0
B    20.0
dtype: float64
A     1
B    10
Name: 0, dtype: int64

Explanation: mean() function calculates the average of each column, median() finds the middle value when sorted and mode() identifies the most frequent value. Since mode() returns a DataFrame, we extract the first row using .iloc[0].

Correlation using corr()

Correlation is a way to measure how strong the linear relationship between two variables is and the direction of its relationship. The corr() method computes pairwise correlation of columns in a Pandas DataFrame, which is always good to use in situations with very large dataset.

Python
import pandas as pd

d = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)

# Calculating correlation between columns
res = df.corr()
print(res)

Output
     A    B
A  1.0  0.9
B  0.9  1.0

Explanation: corr() function calculates the correlation coefficient between all numerical columns. The coefficient ranges from -1 (strong negative correlation) to +1 (strong positive correlation), with 0 indicating no correlation.

Calculating variance and standard deviation

Variance (var()) and Standard Deviation (std()) help measure the dispersion of data points around the mean.

Python
import pandas as pd

d = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)

# Calculating variance and standard deviation
variance_val = df.var()
std_dev_val = df.std()

print(variance_val)
print(std_dev_val)

Output
A     2.5
B    62.5
dtype: float64
A    1.581139
B    7.905694
dtype: float64

Explanation : var() function computes the variance, which measures the spread of the data, while std() calculates the standard deviation, representing the average deviation from the mean.

Calculating skewness and kurtosis

Skewness (skew()) measures the asymmetry of the data distribution, while Kurtosis (kurt()) measures the presence of outliers.

Python
import pandas as pd

d = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)

# Calculating skewness and kurtosis
skewness_val = df.skew()
kurtosis_val = df.kurt()

print(skewness_val)
print(kurtosis_val)

Output
A    0.0
B    0.0
dtype: float64
A   -1.2
B   -1.2
dtype: float64

Explanation: skew() function identifies how much the data is skewed to the left or right, while kurt() measures the tail thickness of the distribution compared to a normal distribution. Positive kurtosis indicates heavier tails, while negative kurtosis suggests lighter tails.

Related Article:


Next Article

Similar Reads