Introduction
Joins are fundamental to data engineering - they allow us to combine data from multiple sources. Polars offers a rich set of join operations that are optimized for performance. Today we’ll explore all join types from basic to advanced, with practical examples you can use immediately.
Quick Reference
| Join Type | What It Returns | Common Use Case |
|---|---|---|
inner |
Only matching rows from both tables | Find common records |
left |
All rows from left + matching from right | Enrich data with reference info |
right |
All rows from right + matching from left | When right table is primary |
full |
All rows from both tables | Combine two datasets completely |
cross |
Cartesian product (every combination) | Generate all combinations |
semi |
Left rows where key exists in right | Filter based on existence |
anti |
Left rows where key NOT in right | Find missing records |
asof |
Match on nearest key (time-series) | Join time-series data |
Setup: Sample Data
import polars as pl
# Left table: Orders
orders = pl.DataFrame({
"order_id": [1, 2, 3, 4, 5],
"customer_id": [101, 102, 103, 104, 105],
"product": ["A", "B", "A", "C", "B"],
"amount": [100, 200, 150, 300, 250],
})
# Right table: Customers
customers = pl.DataFrame({
"customer_id": [101, 102, 103, 106],
"name": ["Alice", "Bob", "Carol", "David"],
"city": ["NYC", "LA", "Chicago", "Seattle"],
})
# Products reference table
products = pl.DataFrame({
"product": ["A", "B", "C", "D"],
"category": ["Electronics", "Books", "Clothing", "Food"],
"price": [99, 25, 50, 10],
})Inner Join
Returns only rows where keys match in both tables.
# Find orders with valid customers
inner_result = orders.join(
customers,
on="customer_id",
how="inner"
)
print(inner_result)Result:
order_id | customer_id | product | amount | name | city
---------|-------------|---------|--------|-------|--------
1 | 101 | A | 100 | Alice | NYC
2 | 102 | B | 200 | Bob | LA
3 | 103 | A | 150 | Carol | Chicago
Key points: - Order 4 and 5 (customers 104, 105) are excluded - not in customers table - Customer David (106) is excluded - has no orders - Only 3 rows returned
Left Join
Returns all rows from left + matching from right (nulls where no match).
# All orders, enrich with customer info where available
left_result = orders.join(
customers,
on="customer_id",
how="left"
)
print(left_result)Result:
order_id | customer_id | product | amount | name | city
---------|-------------|---------|--------|--------|--------
1 | 101 | A | 100 | Alice | NYC
2 | 102 | B | 200 | Bob | LA
3 | 103 | A | 150 | Carol | Chicago
4 | 104 | C | 300 | null | null
5 | 105 | B | 250 | null | null
Key points: - All 5 orders preserved - Orders 4 and 5 have null customer info (customers 104, 105 not in reference table) - Most common join type for data enrichment
Right Join
Returns all rows from right + matching from left (nulls where no match).
# All customers, show their orders (if any)
right_result = orders.join(
customers,
on="customer_id",
how="right"
)
print(right_result)Result:
order_id | customer_id | product | amount | name | city
---------|-------------|---------|--------|-------|--------
1 | 101 | A | 100 | Alice | NYC
2 | 102 | B | 200 | Bob | LA
3 | 103 | A | 150 | Carol | Chicago
null | 106 | null | null | David | Seattle
Key points: - All 4 customers preserved - David (106) has null order info - no orders - Similar to left join but keeps right side
Full/Outer Join
Returns all rows from both tables (nulls where no match).
# Complete view: all orders and all customers
full_result = orders.join(
customers,
on="customer_id",
how="full"
)
print(full_result)Result:
order_id | customer_id | product | amount | name | city
---------|-------------|---------|--------|--------|--------
1 | 101 | A | 100 | Alice | NYC
2 | 102 | B | 200 | Bob | LA
3 | 103 | A | 150 | Carol | Chicago
4 | 104 | C | 300 | null | null
5 | 105 | B | 250 | null | null
null | 106 | null | null | David | Seattle
Key points: - 6 rows total (union of both sets) - Preserves all information from both tables - Use when you need complete data from both sources
Cross Join
Returns Cartesian product - every row from left combined with every row from right.
# Generate all combinations
cross_result = orders.join(
products.select("category").unique(),
how="cross"
)
print(f"Orders: {len(orders)}, Categories: {4}, Result: {len(cross_result)}")
print(cross_result.head(8))Key points: - No on parameter needed - Result has len(left) × len(right) rows - Use sparingly - can create huge datasets - Common use: generating test data or all possible combinations
Semi Join
Returns left rows where key exists in right (no right columns added).
# Find orders from known customers only (filter, don't enrich)
semi_result = orders.join(
customers,
on="customer_id",
how="semi"
)
print(semi_result)Result:
order_id | customer_id | product | amount
---------|-------------|---------|--------
1 | 101 | A | 100
2 | 102 | B | 200
3 | 103 | A | 150
Key points: - Like inner but doesn’t add right columns - Used for filtering only - More efficient than inner join when you don’t need right columns - Orders 4 and 5 excluded (unknown customers)
Anti Join
Returns left rows where key does NOT exist in right.
# Find orders from unknown/missing customers
anti_result = orders.join(
customers,
on="customer_id",
how="anti"
)
print(anti_result)Result:
order_id | customer_id | product | amount
---------|-------------|---------|--------
4 | 104 | C | 300
5 | 105 | B | 250
Key points: - Opposite of semi join - Great for data quality checks - Find orphaned records, missing references
AsOf Join (Time-Series)
Matches on nearest key without exceeding - perfect for time-series.
# Stock prices at different times
prices = pl.DataFrame({
"time": [
"2026-01-01 09:00", "2026-01-01 09:05",
"2026-01-01 09:10", "2026-01-01 09:15"
],
"price": [100.0, 101.5, 99.8, 102.0]
}).with_columns(pl.col("time").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M"))
# Trades happening at various times
trades = pl.DataFrame({
"time": [
"2026-01-01 09:02", "2026-01-01 09:07",
"2026-01-01 09:12"
],
"volume": [100, 200, 150]
}).with_columns(pl.col("time").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M"))
# Match each trade with the most recent price
asof_result = trades.join_asof(
prices,
on="time",
strategy="backward" # Use most recent price before trade
)
print(asof_result)Result:
time | volume | price
--------------------|--------|-------
2026-01-01 09:02:00 | 100 | 100.0 <- price at 09:00
2026-01-01 09:07:00 | 200 | 101.5 <- price at 09:05
2026-01-01 09:12:00 | 150 | 99.8 <- price at 09:10
Key points: - No exact time match needed - strategy="backward" - nearest earlier time - strategy="forward" - nearest later time - Essential for financial data, sensor readings
Multi-Key Joins
Join on multiple columns for more precise matching:
# Sales data with region
sales = pl.DataFrame({
"date": ["2026-01-01", "2026-01-01", "2026-01-02"],
"region": ["North", "South", "North"],
"amount": [100, 200, 150]
})
# Targets by date and region
targets = pl.DataFrame({
"date": ["2026-01-01", "2026-01-01", "2026-01-02"],
"region": ["North", "South", "North"],
"target": [120, 180, 140]
})
# Join on both date and region
multi_join = sales.join(
targets,
on=["date", "region"],
how="left"
).with_columns(
(pl.col("amount") - pl.col("target")).alias("variance")
)
print(multi_join)Join with Suffix
When both tables have columns with same names (other than join keys):
# Both tables have 'amount' column
orders_with_price = pl.DataFrame({
"product": ["A", "B", "C"],
"amount": [100, 200, 300], # order amount
})
products_with_cost = pl.DataFrame({
"product": ["A", "B", "C"],
"amount": [80, 150, 250], # product cost
})
# Add suffix to distinguish columns
result = orders_with_price.join(
products_with_cost,
on="product",
how="left",
suffix="_cost" # right table columns get this suffix
)
print(result)
# Columns: product, amount, amount_costPractice Exercise
You have three datasets for an e-commerce platform:
import polars as pl
from datetime import datetime
# User registrations
users = pl.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"email": ["alice@example.com", "bob@example.com",
"carol@example.com", "david@example.com", "eve@example.com"],
"signup_date": [datetime(2026, 1, 1), datetime(2026, 1, 5),
datetime(2026, 1, 10), datetime(2026, 1, 15),
datetime(2026, 1, 20)],
})
# Orders placed
orders = pl.DataFrame({
"order_id": [101, 102, 103, 104, 105, 106],
"user_id": [1, 2, 1, 3, 99, 2], # Note: user 99 doesn't exist
"order_date": [datetime(2026, 1, 5), datetime(2026, 1, 8),
datetime(2026, 1, 12), datetime(2026, 1, 15),
datetime(2026, 1, 18), datetime(2026, 1, 22)],
"total": [50.0, 75.5, 120.0, 30.0, 200.0, 45.0],
})
# Order items (some orders have multiple items)
items = pl.DataFrame({
"order_id": [101, 101, 102, 103, 103, 103, 104, 105, 106],
"product": ["A", "B", "C", "A", "D", "E", "B", "F", "A"],
"quantity": [2, 1, 3, 1, 2, 1, 1, 5, 2],
"price": [20.0, 10.0, 25.0, 20.0, 30.0, 50.0, 10.0, 40.0, 20.0],
})
# Support tickets
tickets = pl.DataFrame({
"ticket_id": [1, 2, 3],
"user_id": [1, 1, 99], # User 99 doesn't exist
"issue": ["Refund", "Question", "Complaint"],
})Tasks:
Enrich Orders: Create a report showing all orders with user email (left join). How many orders have no matching user?
Find Orphans: Identify orders placed by non-existent users using anti join.
Active Users: Find users who have placed at least one order (semi join). What percentage of registered users are active?
User Order Summary: For each user, calculate: total orders, total spent, average order value. Include users with zero orders.
Order Items Aggregation: Join orders with items and calculate: total items per order, verify order total matches sum(items).
Support Analysis: Find users who have opened support tickets but never placed an order.
Time-Series Match: Using the signup_date and order_date, find each user’s first order time using asof join.
Click to see solutions
# Task 1: Enrich orders with user emails
enriched_orders = orders.join(
users.select("user_id", "email"),
on="user_id",
how="left"
)
print(f"Orders without matching user: {enriched_orders.filter(pl.col('email').is_null()).shape[0]}")
# Task 2: Find orphan orders (anti join)
orphan_orders = orders.join(
users.select("user_id"),
on="user_id",
how="anti"
)
print("Orphan orders:")
print(orphan_orders)
# Task 3: Active users (semi join)
active_users = users.join(
orders.select("user_id").unique(),
on="user_id",
how="semi"
)
active_pct = len(active_users) / len(users) * 100
print(f"Active users: {len(active_users)}/{len(users)} ({active_pct:.1f}%)")
# Task 4: User order summary (groupby with left join for all users)
user_summary = orders.group_by("user_id").agg([
pl.count().alias("total_orders"),
pl.col("total").sum().alias("total_spent"),
pl.col("total").mean().alias("avg_order_value"),
]).join(
users.select("user_id", "email"),
on="user_id",
how="right" # Keep all users even without orders
).fill_null(0).select([
"user_id", "email", "total_orders", "total_spent", "avg_order_value"
])
print(user_summary)
# Task 5: Verify order totals
order_totals = items.group_by("order_id").agg(
pl.col("quantity").sum().alias("total_items"),
(pl.col("quantity") * pl.col("price")).sum().alias("calculated_total")
).join(
orders.select("order_id", "total"),
on="order_id"
).with_columns(
(pl.col("calculated_total") == pl.col("total")).alias("total_matches")
)
print(order_totals)
# Task 6: Support tickets from users without orders
ticket_only_users = tickets.join(
orders.select("user_id").unique(),
on="user_id",
how="anti"
).join(users.select("user_id", "email"), on="user_id", how="left")
print("Users with tickets but no orders:")
print(ticket_only_users)
# Task 7: First order time per user (asof join)
user_first_order = users.join_asof(
orders.select("user_id", "order_date").sort("order_date"),
left_on="signup_date",
right_on="order_date",
by="user_id",
strategy="forward"
).select([
"user_id", "email", "signup_date", "order_date"
])
print("User signup and first order:")
print(user_first_order)