Introduction
One of Polars’ most powerful features is its native support for reading data directly from cloud object storage. No need to download files first — Polars can read and scan Parquet files on S3 with full support for lazy evaluation, predicate pushdown, and streaming. Today we’ll explore how to connect Polars to S3 and work with real-world datasets at scale.
Quick Reference
| Method | Function | When to Use |
|---|---|---|
pl.read_parquet("s3://...") |
Eager read | Small files, quick exploration |
pl.scan_parquet("s3://...") |
Lazy scan | Large files, complex queries |
storage_options={"anon": True} |
Anonymous access | Public datasets, no credentials needed |
storage_options={"profile": "..."} |
Named profile | Multiple AWS accounts |
collect(streaming=True) |
Streaming mode | Datasets larger than memory |
Prerequisites
Polars uses the object_store Rust crate under the hood, so S3 support works out of the box. Make sure you have Polars installed:
pip install polarsFor authentication, credentials can be provided via:
- AWS CLI config (
aws configure) - Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) ~/.aws/credentialsfile- IAM roles (when running on EC2/ECS/Lambda)
storage_optionsparameter in Polars
Reading a Single File (Eager Mode)
The simplest approach — load an entire file into memory:
import polars as pl
# Read a single Parquet file from S3
df = pl.read_parquet(
"s3://daylight-openstreetmap/parquet/osm_features/release=v1.58/type=way/"
"20241112_191814_00139_grr7u_fea4d477-4748-4e7d-9aed-90290d792f01"
)
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory: {df.estimated_size() / 1024 / 1024:.2f} MB")
print(df.head())Key points:
- Works exactly like
pl.read_parquet()with a local file - The entire file is downloaded and loaded into memory
- Best for files that fit comfortably in RAM
Anonymous Access to Public Buckets
Many public datasets on S3 don’t require AWS credentials. Use storage_options to access them anonymously:
# Access public datasets without credentials
df = pl.read_parquet(
"s3://daylight-openstreetmap/parquet/osm_features/release=v1.58/type=way/"
"20241112_191814_00139_grr7u_fea4d477-4748-4e7d-9aed-90290d792f01",
storage_options={"aws_region": "us-east-1"},
)
print(f"✓ Loaded {df.shape[0]:,} rows without any credentials!")When to use storage_options:
| Option | Purpose |
|---|---|
{"anon": True} |
Explicitly anonymous access |
{"aws_region": "us-east-1"} |
Specify region for the bucket |
{"profile": "work"} |
Use a named AWS CLI profile |
{"aws_access_key_id": "...", "aws_secret_access_key": "..."} |
Inline credentials (avoid in production!) |
Lazy Scanning with scan_parquet (Recommended)
For large datasets, scan_parquet is the way to go. It creates a query plan without loading data — the actual read happens only when you call .collect():
# Scan all Parquet files in a directory (glob pattern)
lf = pl.scan_parquet(
"s3://daylight-openstreetmap/parquet/osm_features/**/*.parquet"
)
# Check the schema without reading any data
print(f"Schema: {lf.collect_schema()}")Now build queries on top of the lazy frame — Polars will optimize the execution plan, only reading the columns and rows it needs:
# Count records by type — only reads the 'type' column
type_counts = (
lf
.group_by("type")
.agg(pl.count().alias("count"))
.sort("count", descending=True)
.collect()
)
print(type_counts)Why Lazy is Better for S3
Eager: Download entire file → Load all columns → Filter → Aggregate
Lazy: Build plan → Push predicates down → Read only needed columns → Aggregate
With lazy evaluation, Polars can:
- Project pushdown: Only download columns your query uses
- Predicate pushdown: Skip row groups that don’t match your filters
- Parallel reads: Download multiple files concurrently
Filtering S3 Data Efficiently
Combine lazy scanning with filters to minimize data transfer:
# Only reads 'tags', 'id', 'type', and 'geometry' columns
# Skips row groups where tags definitely don't contain "highway"
highways = (
lf
.filter(pl.col("tags").str.contains("highway"))
.select(["id", "type", "tags", "geometry"])
.limit(100)
.collect()
)
print(f"Found {highways.shape[0]} highway features")
print(highways.head())Aggregating Across Multiple Files
One of the best use cases for Polars + S3 is aggregating across a partitioned dataset:
# Statistics grouped by release version — scans across all files
release_stats = (
lf
.group_by("release")
.agg([
pl.count().alias("total_features"),
pl.col("id").n_unique().alias("unique_ids"),
pl.col("version").mean().alias("avg_version"),
])
.sort("release")
.collect()
)
print(release_stats)Streaming Mode for Very Large Datasets
When even lazy evaluation isn’t enough — because the aggregation intermediate results are too large for memory — enable streaming:
# Streaming processes data in batches, keeping memory usage bounded
streaming_result = (
lf
.filter(pl.col("type") == "way")
.group_by("release")
.agg([
pl.count().alias("count"),
pl.col("version").max().alias("max_version"),
])
.collect(streaming=True) # ← Enable streaming engine
)
print(streaming_result)When to use streaming:
- Dataset doesn’t fit in memory
- Aggregation across many files
- You only need summary statistics, not the raw data
Joining S3 Datasets
You can join multiple S3 datasets together using lazy frames:
# Scan two separate datasets
features_lf = pl.scan_parquet(
"s3://daylight-openstreetmap/parquet/osm_features/**/*.parquet"
)
elements_lf = pl.scan_parquet(
"s3://daylight-openstreetmap/parquet/osm_elements/**/*.parquet"
)
# Join them on 'id' — Polars handles the optimization
joined = (
features_lf.limit(10_000)
.join(
elements_lf.limit(10_000),
on="id",
how="inner"
)
.select(["id", "type", "tags"])
.collect()
)
print(f"Joined result: {joined.shape}")
print(joined.head())Writing Results Back to S3
After analysis, you can write results back to S3:
# Write results to local Parquet
result_df.write_parquet("analysis_results.parquet")
# Write to S3 (requires write permissions)
# Option 1: Use AWS CLI
# aws s3 cp analysis_results.parquet s3://my-bucket/results/
# Option 2: Use boto3 in Python
import boto3
s3_client = boto3.client("s3")
s3_client.upload_file(
"analysis_results.parquet",
"my-bucket",
"results/analysis_results.parquet"
)Performance Tips
- Always prefer
scan_parquetoverread_parquetfor S3 — it enables predicate and projection pushdown - Use glob patterns (
**/*.parquet) to scan partitioned datasets - Enable streaming for aggregations across datasets larger than memory
- Specify
storage_optionswith the correct region to avoid cross-region data transfer - Download locally first for iterative exploration on the same file:
# Download once, read many times
aws s3 cp s3://daylight-openstreetmap/parquet/osm_features/release=v1.58/type=way/FILENAME \
./osm_sample.parquet --no-sign-request
# Then load locally (much faster for repeated reads)
df = pl.read_parquet("./osm_sample.parquet")Practice Exercise
You want to analyze public NYC taxi trip data stored on S3.
import polars as pl
# NYC Taxi dataset (public, partitioned by month)
# s3://nyc-tlc/trip data/yellow_tripdata_2023-*.parquet
# For practice, let's assume you have a local sample
taxi = pl.DataFrame({
"pickup_datetime": [
"2026-01-10 08:15:00", "2026-01-10 09:30:00",
"2026-01-10 12:00:00", "2026-01-10 18:45:00",
"2026-01-11 07:00:00", "2026-01-11 14:30:00",
],
"dropoff_datetime": [
"2026-01-10 08:45:00", "2026-01-10 10:00:00",
"2026-01-10 12:30:00", "2026-01-10 19:15:00",
"2026-01-11 07:25:00", "2026-01-11 15:00:00",
],
"trip_distance": [3.2, 8.5, 2.1, 12.0, 5.5, 4.8],
"fare_amount": [15.0, 32.5, 10.0, 45.0, 22.0, 18.5],
"tip_amount": [3.0, 6.5, 2.0, 9.0, 4.4, 3.7],
"payment_type": ["Credit", "Credit", "Cash", "Credit", "Credit", "Cash"],
"passenger_count": [1, 2, 1, 3, 1, 2],
}).with_columns(
pl.col("pickup_datetime").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
pl.col("dropoff_datetime").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
)Tasks:
Simulate lazy S3 read: Convert the DataFrame to a LazyFrame (simulating
scan_parquet) and calculate average fare per payment type.Trip duration: Calculate the trip duration in minutes for each trip using the pickup and dropoff times.
Cost per mile: Add a column
cost_per_mile(fare / distance) and find the most expensive trip per mile.Peak hours: Extract the pickup hour and find which hour has the highest average fare.
Daily summary: Group by date and calculate: total trips, total revenue (fare + tip), and average trip distance.
Click to see solutions
# Task 1: Lazy evaluation simulation
avg_fare_by_payment = (
taxi.lazy()
.group_by("payment_type")
.agg(pl.col("fare_amount").mean().alias("avg_fare"))
.collect()
)
print("Avg fare by payment type:")
print(avg_fare_by_payment)
# Task 2: Trip duration in minutes
with_duration = taxi.with_columns(
((pl.col("dropoff_datetime") - pl.col("pickup_datetime"))
.dt.total_minutes())
.alias("duration_minutes")
)
print("\nTrip durations:")
print(with_duration.select(["pickup_datetime", "dropoff_datetime", "duration_minutes"]))
# Task 3: Cost per mile
with_cost = taxi.with_columns(
(pl.col("fare_amount") / pl.col("trip_distance")).alias("cost_per_mile")
)
most_expensive = with_cost.sort("cost_per_mile", descending=True).head(1)
print("\nMost expensive trip per mile:")
print(most_expensive)
# Task 4: Peak hours
peak_hours = (
taxi.with_columns(
pl.col("pickup_datetime").dt.hour().alias("pickup_hour")
)
.group_by("pickup_hour")
.agg(pl.col("fare_amount").mean().alias("avg_fare"))
.sort("avg_fare", descending=True)
)
print("\nPeak hours by avg fare:")
print(peak_hours)
# Task 5: Daily summary
daily_summary = (
taxi.with_columns(
pl.col("pickup_datetime").dt.date().alias("date"),
(pl.col("fare_amount") + pl.col("tip_amount")).alias("total_revenue"),
)
.group_by("date")
.agg([
pl.count().alias("total_trips"),
pl.col("total_revenue").sum().alias("total_revenue"),
pl.col("trip_distance").mean().alias("avg_distance"),
])
.sort("date")
)
print("\nDaily summary:")
print(daily_summary)