fork() for AI agents.

When your agent forks N ways to explore a problem, thaw skips the cold prefill and runs them in parallel from one shared memory. The substrate for RL rollouts, multi-agent reasoning, and parallel coding agents.

Validated onvLLM·SGLang·LangGraph·H100 · A40 · A6000·8/8 bit-identical
Primitive

One function. N divergent futures.

Snapshot the engine, restore N children that share the parent's KV cache at the fork point, diverge. Same process or across workers.

fork_example.py
THAW · v0.4
from thaw_vllm import LLM, fork # A live vLLM engine, already serving traffic.parent = LLM("meta-llama/Llama-3.1-8B")parent.generate(prefix_messages)  # warms the KV cache + prefix-hash table # Snapshot mid-stream → fork into N divergent children.children = fork(parent, n=4)   # 0.88s median on H100 · steady state # Each child resumes from the parent's exact context.# No re-prefill. Same trunk, different futures.for child, suffix in zip(children, suffixes):    print(child.generate(suffix))
Without thaw
N rollouts × prefill_time
Re-prefill the same 8K-token prompt eight times before any divergence happens.
With thaw
N rollouts × memcpy_time
One snapshot, eight forks. Bounded by PCIe bandwidth, not by the model.
Agent loops

Built for the loops agents actually run.

TREE-SEARCH · MCTS · BEST-OF-N

Agent branching

Fork a reasoning trunk into N parallel hypotheses mid-conversation. Each child inherits the parent's KV cache at the fork point, runs concurrently on the same GPU, then you pick the winner. The expensive trunk only gets paid for once.

prefill per trunk
from thaw_vllm import LLM, fork
llm = LLM("meta-llama/Llama-3.1-8B")
llm.generate(trunk_messages) # warm KV
forks = fork(llm, n=8) # 8 branches
best = pick_best(f.generate(hyps[i]) for i, f in enumerate(forks))
How it works

Under the hood: snapshot, fan-out, diverge.

One parent engine, N children, zero re-prefill — but the magic is in four specific tricks. Watch the loop: scattered KV slabs coalesce, ping-pong through two pinned buffers, land on children via PCIe Gen5, then get tagged with prefix hashes so the next request is a cache hit. Real mechanics, real numbers.

fork_flow.v0.4 · live
COALESCE
coalesce · pipeline · restore · diverge — loops every 14s
10K → 1
KV slabs gather into one contiguous tensor
Per-slab DMA tops out at ~50 MB/s. One coalesced tensor hits 3.4 GB/s. kv_snapshot.py:_coalesce_kv_to_gpu_buffer.
01

Coalesce

10K → 1

KV slabs gather into one contiguous tensor

02

Pipeline

55 GB/s

Two pinned buffers ping-pong · 86% of PCIe Gen5 ceiling

03

Restore

CRC32C ✓

Chunks land on child GPUs · parallel verification

04

Diverge

0.88s

Prefix hashes re-inserted · children skip prefill

Receipts

Receipts, not benchmarks.

raw JSON · github
#Category · Test rig
01
Fork latency · steady state

H100 80GB PCIe · Llama-3.1-8B · 5 rounds × 4 branches × 64 tokens

0.88s
Median fork round
1.16s
First round (post-warmup)
400×
Warmup amortization vs cold boot
4
Concurrent branches per fork
02
DMA restore · PCIe Gen5

H100 SXM · pinned host memory · double-buffered pipelined DMA

55 GB/s
DMA restore · line-rate saturation
0.29s
Hot-swap between 8B engines
14 GB/s
Parallel CRC32C peak
2.89×
vs serial CRC verification
03
70B sleep snapshot · TP=2

2× H100 SXM · Llama-3.1-70B TP=2 · bit-identical across 8 architectures

141 GB
Resident 70B TP=2 snapshot
16.1s
Sleep · 9.04 GB/s aggregate
53.6s
Wake from snapshot
8/8
Architectures bit-identical
All measurements re-runnable from the GitHub repo · raw JSON below.v0.4 · production preview
Vision

The primitive is open. The integrations are the company.

Pinecone became a company by being the easiest vector DB the moment RAG mattered. LangChain became a company by being the orchestration layer everyone calls. thaw becomes a company by being the fork primitive every agent framework — TRL, slime, verl, LangGraph, Temporal — calls when an agent loop needs to branch.

TODAY

Open-source primitive.

The fork() runtime is Apache-2.0, written in Rust + CUDA, integrated with the engines real production traffic runs on. Anyone can pip install thaw-vllm and snapshot a live agent session.

v0.4 · pre-tagged · production preview
PARTNERS

Wired with the teams building agents.

Working with Courier (SLC) on MLX-side inference for the Apple-silicon agentic stack — paid integration sprint validates the snapshot/restore semantics against a non-vLLM engine in production. RFC #34303 upstream with vLLM.

courier · vllm · sglang · langgraph
TOMORROW

The fork primitive every agent framework calls.

The OSS primitive is the wedge. The company is the framework-layer integrations — TRL, slime, verl, LangGraph, Temporal — that make fork() a first-class verb in every agent loop. Agent branching, RL rollouts, parallel coding agents, multi-agent reasoning. Not a platform. A primitive.

primitive → framework integrations → ecosystem
Integrations

Plugs into the engines you already run.

01PRIMARY ENGINE

vLLM

First-class integration via vLLM's load_format extension. Snapshot a live vLLM engine, restore into a fresh process, prefix-cache hash table rebuilt on the way back. RFC #34303 in flight upstream for sleep-mode integration.

RFC #34303 ↗
vllm_example.pyv0.4
from vllm import LLM
from thaw_vllm import fork
 
parent = LLM(model="meta-llama/Llama-3.1-8B",
load_format="thaw")
 
children = fork(parent, n=4) # PCIe Gen5 line-rate
02CLASS-PASSTHROUGH

SGLang

Class-passthrough loader keeps SGLang's scheduler intact and snapshots the runtime state directly. Same fork() ergonomics as vLLM. Validated on H100 SXM with Llama, Qwen, and DeepSeek architectures.

sglang_example.pyv0.4
from sglang import Engine
from thaw_native import ThawEngine
 
parent = ThawEngine(Engine, "…/Qwen2.5-7B")
parent.generate(prefix)
 
children = parent.fork(n=4)
03DROP-IN CHATMODEL

LangGraph

Drop-in LangChain BaseChatModel — every existing LangGraph node works unchanged. fork_fanout(llm, prefix, [suffixes]) exposes explicit fan-out for tool-use branching and parallel reviewers. PR-review demo: 4 reviewers, 1.43s median round.

langgraph_example.pyv0.4
from thaw_vllm.langgraph import ChatThaw, fork_fanout
 
llm = ChatThaw(model="…/Llama-3.1-8B")
 
reviews = fork_fanout(
llm, diff_messages,
suffixes=["security", "perf", "tests", "style"]
)
Install

Pre-built wheels. No Rust toolchain required.

Two PyPI packages: thaw-vllm for first-class vLLM integration, thaw-native for the underlying Rust runtime. CUDA wheels published for Python 3.10–3.12. Apache-2.0.

vLLM integration
$pip install thaw-vllm
Native runtime
$pip install thaw-native
Apache 2.0 · Rust + CUDA · production preview
vLLM RFC #34303 upstream · partner with us