How To Calculate Summary Statistics In Pandas
Last Updated :
16 Apr, 2025
Pandas, an incredibly versatile data manipulation library for Python, has various capabilities to calculate summary statistics on datasets. Summary statistics can give you a fast and comprehensive overview of the most important features of a dataset. In the following article, we will explore five methods of computing summary statistics using Pandas.
Using describe() for Descriptive Statistics
The describe() method is a strong method to generate descriptive statistics of a DataFrame. The describe() method will provide you with detailed summary statistics including count, mean, standard deviation, min, 25th percentile, median (50th percentile), 75th percentile, and max.
Python
import pandas as pd
d = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)
# Using describe() to calculate summary statistics
res = df.describe()
print(res)
Output
A B
count 5.000000 5.000000
mean 3.000000 20.000000
std 1.581139 7.905694
min 1.000000 10.000000
25% 2.000000 15.000000
50% 3.000000 20.000000
75% 4.000000 25.000000
max 5.000000 30.000000
Explanation: In this, we create a dictionary with numerical values and convert it to a Pandas Dataframe. We then used the describe() function on the DataFrame which provides a summary of key statistics including count, mean, standard deviation and percentiles.
Pandas also has distinct functions to calculate the mean, median and mode of each column in a DataFrame.
Python
import pandas as pd
d = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)
# Calculating mean, median, and mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]
print(mean_values)
print(median_values)
print(mode_values)
OutputA 3.0
B 20.0
dtype: float64
A 3.0
B 20.0
dtype: float64
A 1
B 10
Name: 0, dtype: int64
Explanation: mean() function calculates the average of each column, median() finds the middle value when sorted and mode() identifies the most frequent value. Since mode() returns a DataFrame, we extract the first row using .iloc[0].
Correlation using corr()
Correlation is a way to measure how strong the linear relationship between two variables is and the direction of its relationship. The corr() method computes pairwise correlation of columns in a Pandas DataFrame, which is always good to use in situations with very large dataset.
Python
import pandas as pd
d = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)
# Calculating correlation between columns
res = df.corr()
print(res)
Output A B
A 1.0 0.9
B 0.9 1.0
Explanation: corr() function calculates the correlation coefficient between all numerical columns. The coefficient ranges from -1 (strong negative correlation) to +1 (strong positive correlation), with 0 indicating no correlation.
Calculating variance and standard deviation
Variance (var()) and Standard Deviation (std()) help measure the dispersion of data points around the mean.
Python
import pandas as pd
d = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)
# Calculating variance and standard deviation
variance_val = df.var()
std_dev_val = df.std()
print(variance_val)
print(std_dev_val)
OutputA 2.5
B 62.5
dtype: float64
A 1.581139
B 7.905694
dtype: float64
Explanation : var() function computes the variance, which measures the spread of the data, while std() calculates the standard deviation, representing the average deviation from the mean.
Calculating skewness and kurtosis
Skewness (skew()) measures the asymmetry of the data distribution, while Kurtosis (kurt()) measures the presence of outliers.
Python
import pandas as pd
d = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(d)
# Calculating skewness and kurtosis
skewness_val = df.skew()
kurtosis_val = df.kurt()
print(skewness_val)
print(kurtosis_val)
OutputA 0.0
B 0.0
dtype: float64
A -1.2
B -1.2
dtype: float64
Explanation: skew() function identifies how much the data is skewed to the left or right, while kurt() measures the tail thickness of the distribution compared to a normal distribution. Positive kurtosis indicates heavier tails, while negative kurtosis suggests lighter tails.
Related Article:
Similar Reads
How to Calculate Summary Statistics by Group in R?
In this article, we will discuss how to calculate summary statistics by the group in the R programming language. What is summary statistics in R?Summary Statistics by Group in R Programming Language are numerical or graphical representations that provide a concise and informative overview of a datas
5 min read
Use Pandas to Calculate Statistics in Python
Performing various complex statistical operations in python can be easily reduced to single line commands using pandas. We will discuss some of the most useful and common statistical operations in this post. We will be using the Titanic survival dataset to demonstrate such operations. Python3 # Impo
7 min read
Compute Summary Statistics In R
Summary statistics provide a concise overview of the characteristics of a dataset, offering insights into its central tendency, dispersion, and distribution. R Programming Language with its variety of packages, offers several methods to compute summary statistics efficiently. Here we'll explore vari
4 min read
How to get summary statistics by group in R
In this article, we will learn how to get summary statistics by the group in R programming language. Sample dataframe in use: grpBy num 1 A 20 2 A 30 3 A 40 4 B 50 5 B 50 6 C 70 7 C 80 8 C 25 9 C 35 10 D 45 11 E 55 12 E 65 13 E 75 14 E 85 15 E 95 16 E 105Method 1: Using tapply() tapply() function in
6 min read
How to Calculate Rolling Median in Pandas?
In this article, we will see how to calculate the rolling median in pandas. A rolling metric is usually calculated in time series data. It represents how the values are changing by aggregating the values over the last 'n' occurrences. The 'n' is known as the window size. The aggregation is usually t
4 min read
How to Calculate Weighted Average in Pandas?
A weighted average is a computation that considers the relative value of the integers in a data collection. Each value in the data set is scaled by a predefined weight before the final computation is completed when computing a weighted average. Syntax: def weighted_average(dataframe, value, weight):
3 min read
How to Standardize Data in a Pandas DataFrame?
In this article, we will learn how to standardize the data in a Pandas Dataframe. Standardization is a very important concept in feature scaling which is an integral part of feature engineering. When you collect data for data analysis or machine learning, we will be having a lot of features, which a
4 min read
How to Create Summary Tables in R?
In this article, we will discuss how to create summary tables in R Programming Language. The summary table contains the following information: vars: represents the column numbern: represents the number of valid casesmean: represents the mean valuemedian: represents the median valuetrimmed: represent
4 min read
How to calculate mean of a CSV file in R?
Mean or average is a method to study central tendency of any given numeric data. It can be found using the formula. Mean= (sum of data)/(frequency of data) In this article, we will be discussing two different ways to calculate the mean of a CSV file in R. Data in use: Method 1: Using mean function I
2 min read
Use Pandas to Calculate Stats from an Imported CSV file
The library in Python that allows users to analyze big data and work with datasets is known as Pandas. Pandas allow users to create the data frame or read from the CSV file using the read_csv function. Once you have created or imported the data, you can calculate various statistics from it, such as
4 min read