Delta Weight Sync#
Why#
Slime’s default sync broadcasts every parameter every step. The cost scales linearly with model size and dominates the sync phase, even though only a few percent of weights change between consecutive RL steps. Delta sync keeps a pinned-CPU snapshot of the last broadcast and ships only the positions whose bytes differ.
The motivating use case is training/inference disaggregation — running the trainer and the rollout engines in different datacenters over a shared filesystem with bandwidth on the order of 100s of MB/s, where a full broadcast is infeasible but a sparse delta (~3% density, ~5 GB for a 355B model) is. The same delta machinery also runs over NCCL inside a single datacenter, where it serves as the validation baseline that proves the wire encoding and apply logic are correct.
Prior art: selective overwrite is inspired by arXiv:2509.19128; the cross-DC disaggregation motivation is from Fireworks AI — Frontier RL Is Cheaper Than You Think.
Quick Start#
Disk transport (training/inference disaggregation — the main use case):
--update-weight-mode delta
--update-weight-transport disk
--update-weight-encoding deltas_zstd # best for ≤ 300 MB/s shared FS
--update-weight-delta-dir /shared/fs/delta-updates
NCCL transport (intra-datacenter validation baseline):
--update-weight-mode delta
--update-weight-transport nccl
--update-weight-encoding indices # lowest compute, no compression
Receiver-side tuning (applies to both transports):
--sglang-update-weight-delta-chunk-bytes $((2 * 1024 * 1024 * 1024)) # byte cap per load_weights call
--sglang-update-weight-delta-read-workers 4 # parallel I/O threads (disk only)
See examples/delta_weight_sync/run-glm4.7-355B-A32B-delta.sh for a complete launcher.
How It Works#
Both transports share one sender pipeline, one wire layout, and one receiver-side decoder; only the per-flush carrier differs.
Sender (per sync, PP-source rank only):
Diff the current weights against the pinned-CPU snapshot via bytewise compare (
current.view(int_dtype) != snapshot.view(int_dtype)) — lossless, dtype-agnostic, no arithmetic.Encode changed (position, value) pairs into a packed
__positions__byte blob +__values__tensor + per-param decoding manifest. The encoding (indices,deltas,deltas_zstd) governs only how positions are packed; values are sent verbatim in the param’s dtype.Bucket per-chunk encodes up to
--update-weight-buffer-sizebytes, then flush:NCCL: broadcast
(__positions__, __values__)to the rollout engines with aDeltaSpec(encoding + per-param manifest) carried in the Ray RPC.Disk: write one safetensors file per flush under
weight_v{N:06d}/. Async background thread does the I/O + optional zstd compression off the critical path.
Snapshot the just-sent values via a D2H copy on a side stream so it overlaps with the next chunk’s encode.
End-of-sync (disk only): write a DONE marker, then rank 0 fires one HTTP push per engine and removes the directory after every engine acknowledges.
Receiver:
For both transports, the receiver ends up calling the same _apply_delta_payload(encoding, params, positions, values) helper. It decodes each param’s slice into a full-shape tensor with NaN at unchanged positions, then routes it through model.load_weights(...) under a _delta_apply_context that patches Tensor.copy_ / Tensor.fill_ to perform NaN-masked overwrite. Auxiliary writes (scratch buffers, fp8 scales, MoE biases via post_load_weights) keep their normal semantics.
Selective overwrite has no arithmetic — the receiver writes the trainer’s exact bytes at changed positions — so it’s lossless by construction and there’s no notion of drift to fight with periodic base re-syncs.
Encoding Choice#
--update-weight-encoding picks how positions are packed. All three share the same on-wire layout (__positions__ uint8 blob + __values__ tensor + per-param manifest); decoder dispatches on the metadata.
value |
positions |
when to pick |
|---|---|---|
|
int32 absolute positions (4 bytes / nnz) |
NCCL or fast intra-cluster FS (≥ ~600 MB/s) |
|
uint16 gap-deltas with uint32 fallback (~2 bytes / nnz at 2% density) |
medium FS bandwidth (~300-500 MB/s) |
|
|
cross-DC / cross-region shared FS (≤ ~300 MB/s) |
Why gap-encoded positions are smaller: positions come out of mask.nonzero() already sorted ascending. At density p, the expected gap between consecutive nonzero positions is 1/p, and P(gap > 65535) ≈ exp(-p · 65535). At p = 2% that’s effectively zero, so uint16 fits with a uint32 per-param fallback for pathological inputs. Half the position bytes of indices, lossless.
Break-even with indices at our density (~2%): deltas halves the positions blob (which dominates the wire); zstd shaves another ~35-40% on top by compressing the gap byte stream, at the cost of ~250ms/file compress + ~150ms/file decompress. The crossover with indices is where compress/decompress compute exceeds the bandwidth savings — empirically around 500 MB/s for deltas and 300 MB/s for deltas_zstd.
Why Not Colocated#
Colocated weight sync uses CUDA IPC: only a memory handle (~64 B) crosses processes. Delta encoding’s “bytes saved on the wire” benefit is zero, while the bookkeeping (snapshot + diff + sparse encode) is pure overhead. Slime rejects --update-weight-mode delta --colocate at argparse time.