Open In App

How to Create a Decile Column in Python Polars

Last Updated : 12 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In this tutorial, we'll learn how to create a decile column using Python's Polars library. Deciles are a common way to divide data into ten equal parts, each containing 10% of the values. They are often used in statistics to understand data distribution, making them a powerful tool in data analysis.

Installing Polars

Polars is a high-performance DataFrame library that's well-suited for handling large datasets. We can install polars using pip.

pip install polars

Loading and Understanding the Data

Let's start by loading a dataset. For this example, we'll create a DataFrame with some random numerical values:

Python
import polars as pl
import numpy as np

# Creating a DataFrame with random data
data = pl.DataFrame({
    'id': np.arange(1, 101),
    'value': np.random.randint(100, 1000, 100)
})

print(data)
Screenshot-2024-09-11-224409
Creating a Polars DataFrame

This will generate a dataset with 100 rows and two columns: id and value. The value column contains random integers between 100 and 1000

Calculating Deciles

The decile for each row will be based on the value column. We'll use Polars' qcut function to divide the data into deciles.

Here’s how to create a decile column:

Python
import polars as pl
import numpy as np

# Creating a DataFrame with random data
data = pl.DataFrame({
    'id': np.arange(1, 101),
    'value': np.random.randint(100, 1000, 100)
})

print(data)

# Define the number of deciles
decile_bins = 10

# Calculate deciles and create a new 'decile' column
data = data.with_columns(
  	# Use pl.col('value') to access the column and then apply qcut
    pl.col('value').qcut(decile_bins).alias('decile')
)

print(data)
Screenshot-2024-09-11-224415

Explanation:

  • pl.qcut('value', decile_bins) divides the value column into 10 quantiles (deciles).
  • The result is a new column called decile, where each row is assigned a decile rank from 0 to 9 (i.e., the 1st to 10th decile).

Sorting and Grouping by Deciles

We might also want to sort the data or group it by deciles to get an overview:

1. Sorting by Decile:

Python
# ...

sorted_data = data.sort('decile')
print(sorted_data)

Output:

Screenshot-2024-09-11-224421
Sorting Decile Column

2. Grouping by Decile and Calculating Summary Statistics:

Python
# Group by decile and calculate summary statistics
summary_stats = data.groupby('decile').agg(
    [
        pl.col('value').mean().alias('mean_value'),
        pl.col('value').min().alias('min_value'),
        pl.col('value').max().alias('max_value'),
        pl.count().alias('count')
    ]
)

print(summary_stats)

Output:

Screenshot-2024-09-11-224432
Grouping by Decile

This will give us a summary of each decile, showing the mean, minimum, and maximum values for the value column, along with the number of rows in each decile.

Conclusion

Creating decile columns in Python using Polars is straightforward and efficient. With the qcut function, we can quickly assign deciles to our data and use them for analysis, sorting, or grouping.


Next Article
Article Tags :
Practice Tags :

Similar Reads