100 Days of Polars - Day 002: Window Functions

Introduction: What Are Window Functions?

Window functions are one of the most powerful features in Polars for performing calculations across rows in a group. Unlike aggregations group_by that collapse rows into single values, window functions maintain all rows while computing values based on a “window” of data.

Window function allows you to:

Calculate running totals and moving averages
Rank rows within groups
Compare values with group averages
Perform time-based calculations

Window function can be used through .over() in polars. You can achive the result as window function with group_by and join in polars.

import polars as pl
from polars import window_function as wf

df = pl.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "David", "Eve", "Frank"],
    "salary": [50000, 60000, 55000, 75000, 80000, 72000]
})

# Add a column showing each employee's salary compared to their department's average.
df.with_columns(
    dept_avg_salary=pl.col("salary").mean().over("department")
)

# Equivalent with group_by and join
dept_avg = df.group_by("department").agg(
    pl.col("salary").mean().alias("dept_avg_salary")
)
df.join(dept_avg, on="department")

Common Window Functions

Aggregations

All standard aggregation functions work as window functions, aggregation functions must come before window functions .over:

df.with_columns(
    dept_salary_sum=pl.col("salary").sum().over("department"),
    dept_salary_min=pl.col("salary").min().over("department"),
    dept_salary_max=pl.col("salary").max().over("department"),
    dept_salary_count=pl.col("salary").count().over("department")
)

Cumulative Sums

df.sort("salary").with_columns(
    running_total=pl.col("salary").cum_sum().over("department")
)

Ranking

df.with_columns(
    salary_rank=pl.col("salary").rank().over("department")
)

Available ranking methods: - "ordinal" - sequential integers (1, 2, 3…) - "dense" - dense ranking without gaps - "min", "max", "avg" - various average ranking methods

First and Last Values

df.sort("salary").with_columns(
    lowest_in_dept=pl.col("salary").first().over("department"),
    highest_in_dept=pl.col("salary").last().over("department")
)

Practice Exercise

Scenario: You have a sales dataset with regional sales data:

import polars as pl
from datetime import datetime

sales_df = pl.DataFrame({
    "region": ["North", "North", "North", "South", "South", "South", "East", "East"],
    "salesperson": ["Alice", "Bob", "Carol", "David", "Eve", "Frank", "Grace", "Henry"],
    "quarter": ["Q1", "Q1", "Q2", "Q1", "Q2", "Q2", "Q1", "Q2"],
    "sales": [10000, 15000, 12000, 20000, 18000, 22000, 14000, 16000]
})

Tasks:

Calculate each salesperson’s sales as a percentage of their region’s total
Rank salespeople within each region by sales amount
Calculate running total of sales within each region (sorted by sales amount)
Find the top performer in each region and quarter

Click to see solutions

# Task 1: Percentage of regional total
sales_df.with_columns(
    region_total=pl.col("sales").sum().over("region"),
    pct_of_region=(pl.col("sales") / pl.col("region_total")) * 100
).drop("region_total")

# Task 2: Rank within region
sales_df.with_columns(
    rank_in_region=pl.col("sales").rank("ordinal", descending=True).over("region")
)

# Task 3: Running total within region
sales_df.sort("sales", descending=True).with_columns(
    running_total=pl.col("sales").cum_sum().over("region")
)

# Task 4: Top performer in each region and quarter
sales_df.with_columns(
    max_sales=pl.col("sales").max().over("region", "quarter")
).filter(pl.col("sales") == pl.col("max_sales")).drop("max_sales")