Open In App

How to Apply a Custom Function in Polars that Does the Processing Row by Row?

Last Updated : 29 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Polars is a fast DataFrame library in Rust and Python, designed to handle large datasets efficiently. It provides a powerful API for data manipulation, similar to pandas, but with performance optimizations that can significantly speed up your data processing tasks. One common task in data processing is applying custom functions to manipulate data row by row. In this article, we will explore how to do this using Polars.

Prerequisites

Before we begin, ensure you have Polars installed in your Python environment. You can install it using pip:

pip install polars

You should also have a basic understanding of Python programming and DataFrame operations.


Loading Data into Polars DataFrame

To demonstrate how to apply a custom function row by row in Polars, we'll first create a sample DataFrame. This code creates a DataFrame with three columns: name, age, and salary

Python
import polars as pl

# Create a sample DataFrame
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "age": [25, 30, 35, 40, 45],
    "salary": [50000, 60000, 70000, 80000, 90000]
}

df = pl.DataFrame(data)
print(df)

Output

shape: (5, 3)
┌─────────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪═════╪════════╡
│ Alice ┆ 25 ┆ 50000 │
│ Bob ┆ 30 ┆ 60000 │
│ Charlie ┆ 35 ┆ 70000 │
│ David ┆ 40 ┆ 80000 │
│ Eva ┆ 45 ┆ 90000 │
└─────────┴─────┴────────┘

Applying a Custom Function Row by Row

In Polars, you can apply a custom function to each row using the apply method. The custom function can be defined to process data as needed. Here are three examples to illustrate this.

Example 1: Applying a Custom Function to Calculate Age Category

we define a custom function categorize_age that categorizes individuals into age groups: "Young," "Middle-aged," and "Senior." We then apply this function to each row of the DataFrame using pl.struct().apply(). The result is a new column named "age_category" that contains the age category for each individual.

Python
# Custom function to categorize age
def categorize_age(row):
    age = row["age"]
    if age < 30:
        return "Young"
    elif 30 <= age < 40:
        return "Middle-aged"
    else:
        return "Senior"

# Apply the custom function row by row
df = df.with_columns([
    pl.struct(["age"]).apply(categorize_age).alias("age_category")
])
print(df)

Output

shape: (5, 4)
┌─────────┬─────┬────────┬──────────────┐
│ name ┆ age ┆ salary ┆ age_category │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞═════════╪═════╪════════╪══════════════╡
│ Alice ┆ 25 ┆ 50000 ┆ Young │
│ Bob ┆ 30 ┆ 60000 ┆ Middle-aged │
│ Charlie ┆ 35 ┆ 70000 ┆ Middle-aged │
│ David ┆ 40 ┆ 80000 ┆ Senior │
│ Eva ┆ 45 ┆ 90000 ┆ Senior │
└─────────┴─────┴────────┴──────────────┘

Example 2: Applying a Custom Function to Adjust Salary Based on Age

This example demonstrates a custom function adjust_salary that adjusts salaries based on age groups, applying different multipliers for each group. We use pl.struct().apply() to apply this function to each row, resulting in a new column called "adjusted_salary" that reflects the adjusted salaries.

Python
# Custom function to adjust salary
def adjust_salary(row):
    age = row["age"]
    salary = row["salary"]
    if age < 30:
        return salary * 1.1
    elif 30 <= age < 40:
        return salary * 1.05
    else:
        return salary * 1.03

# Apply the custom function row by row
df = df.with_columns([
    pl.struct(["age", "salary"]).apply(adjust_salary).alias("adjusted_salary")
])
print(df)

Output

shape: (5, 5)
┌─────────┬─────┬────────┬──────────────┬─────────────────┐
│ name ┆ age ┆ salary ┆ age_category ┆ adjusted_salary │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str ┆ f64 │
╞═════════╪═════╪════════╪══════════════╪═════════════════╡
│ Alice ┆ 25 ┆ 50000 ┆ Young ┆ 55000.0 │
│ Bob ┆ 30 ┆ 60000 ┆ Middle-aged ┆ 63000.0 │
│ Charlie ┆ 35 ┆ 70000 ┆ Middle-aged ┆ 73500.0 │
│ David ┆ 40 ┆ 80000 ┆ Senior ┆ 82400.0 │
│ Eva ┆ 45 ┆ 90000 ┆ Senior ┆ 92700.0 │
└─────────┴─────┴────────┴──────────────┴─────────────────┘

Example 3: Combining Multiple Columns in a Custom Function

we create a custom function combine_name_age that combines the name and age columns into a single string. By applying this function using pl.struct().apply(), we generate a new column "name_age" that contains the combined string for each row, providing a concise representation of name and age together.

Python
# Custom function to combine name and age
def combine_name_age(row):
    return f"{row['name']} ({row['age']})"

# Apply the custom function row by row
df = df.with_columns([
    pl.struct(["name", "age"]).apply(combine_name_age).alias("name_age")
])
print(df)


Output

shape: (5, 6)
┌─────────┬─────┬────────┬──────────────┬─────────────────┬──────────────┐
│ name ┆ age ┆ salary ┆ age_category ┆ adjusted_salary ┆ name_age │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str ┆ f64 ┆ str │
╞═════════╪═════╪════════╪══════════════╪═════════════════╪══════════════╡
│ Alice ┆ 25 ┆ 50000 ┆ Young ┆ 55000.0 ┆ Alice (25) │
│ Bob ┆ 30 ┆ 60000 ┆ Middle-aged ┆ 63000.0 ┆ Bob (30) │
│ Charlie ┆ 35 ┆ 70000 ┆ Middle-aged ┆ 73500.0 ┆ Charlie (35) │
│ David ┆ 40 ┆ 80000 ┆ Senior ┆ 82400.0 ┆ David (40) │
│ Eva ┆ 45 ┆ 90000 ┆ Senior ┆ 92700.0 ┆ Eva (45) │
└─────────┴─────┴────────┴──────────────┴─────────────────┴──────────────┘

Conclusion

Applying custom functions row by row in Polars is straightforward and efficient. By using the apply method, you can implement various custom processing needs directly within your DataFrame operations. Polars' performance and flexibility make it an excellent choice for high-performance data processing tasks.


Next Article
Article Tags :
Practice Tags :

Similar Reads