← All posts

Why a million Rust coroutines weigh less than 100k Python ones

async fn becomes a state machine everywhere — but Rust puts it on the stack while CPython puts it on the heap. Memory benchmarks and practice.

Contents

In brief

The same async service on Rust holds hundreds of thousands of connections with predictable RSS; on Python you shard workers because memory grows faster than request count. A Habr deep-dive compares where suspended task state lives — stack vs heap — across Tokio, asyncio, .NET, and V8.

What happened

Take a typical handler: accept, read, query the DB, respond. In every language async fn lowers to a state machine: each await is a state variant; locals that survive suspension must be stored somewhere — either on the caller’s stack or in a heap allocation.

Rust’s cargo expand shows enums like HandleState with Start, AwaitingDb, AwaitingFetch. Size is static; the future sits on the stack unless you Box::pin. An empty timer coroutine is tens of bytes; a handler with buffers is kilobytes.

CPython wraps each async def in a coroutine object plus a frame on the heap under the GC. Benchmarking 100,000 empty coroutines on asyncio.sleep lands around 45–50 MB before any real payload.

C# keeps IAsyncStateMachine on the stack until the first real suspension, then boxes to the heap — one allocation per task. JavaScript (V8) stopped spawning extra promises per await after 2018, but there is still no compact stack state machine: in-flight work is promises and reactions on the heap.

Runtime 100k empty coroutines Per task
Rust / Tokio hundreds of KB – few MB hundreds of bytes
Python / asyncio ~45–50 MB ~0.5–1 KB

Why it matters

The myth “a sleeping coroutine weighs nothing” breaks at scale. A paused task costs its fattest state frame plus runtime wrapper overhead you cannot drop. Ten thousand idle connections is measurable megabytes, not noise.

For architecture, high-concurrency I/O is not only “Rust is faster on CPU.” It is linear RSS growth with concurrent awaits. Interpreter and managed runtimes hit memory and allocators orders of magnitude sooner.

In Rust, “drop heavy values before await” shrinks the enum. In Python you pay for both frame data and the coroutine object.

In practice

  1. Measure suspension cost, not call cost: bytes alive across each await.
  2. In Rust, inspect async block sizes before release; treat Box::pin as explicit.
  3. In Python, cap concurrency with a semaphore or pool — avoid millions of idle coroutines.
  4. In .NET, use ValueTask on often-synchronous paths to reduce state-machine boxing.
  5. In Node.js, under load you often hit GC and in-flight promises before CPU — use heap snapshots.
  6. When sizing connection limits, budget memory per task for your runtime, not only FDs.

Takeaway

Async is sold as “write sync, get concurrency free.” Only the syntax is free — pauses bill bytes per sleeping task. Rust charges at compile time; Python and managed runtimes hide the bill in the heap until RSS graphs spike. When an async service “eats memory for no reason,” look at what survives each await first.