PD Disaggregation#

PD Disaggregation separates Prefill and Decode workers in SGLang rollout. This is especially useful for multi-turn, long-context, and agentic RL workloads where prompt processing and token generation have very different compute and memory profiles.

When to Use#

Use PD Disaggregation when:

  • rollout contexts are long or grow across turns;

  • decode dominates rollout time;

  • prefix-cache locality matters for multi-turn sessions;

  • prefill and decode need different TP, memory, or runtime settings;

  • you want an SGLang serving topology that is closer to production serving rather than a single uniform inference group.

For short single-turn tasks, the default regular SGLang engine layout is usually simpler.

Configuration Paths#

slime supports two ways to configure PD.

Simple Path: --prefill-num-servers#

For a single actor model with a simple PD layout, set:

--prefill-num-servers 1

This is the lightweight path used by simple scripts. It is convenient when you only need to split prefill/decode without tuning each group separately.

Advanced Path: --sglang-config#

For production rollout topologies, use SGLang Config. It lets you configure prefill and decode groups independently, and can also express EPD-style layouts, heterogeneous server groups, multi-model serving, and per-group SGLang overrides.

Example:

sglang:
  - name: actor
    update_weights: true
    server_groups:
      - worker_type: prefill
        num_gpus: 4
        num_gpus_per_engine: 2
        overrides:
          chunked_prefill_size: 8192
      - worker_type: decode
        num_gpus: 12
        num_gpus_per_engine: 4
        overrides:
          mem_fraction_static: 0.88

Launch with:

python train.py \
  --sglang-config sglang_pd.yaml \
  --rollout-num-gpus 16 \
  ...

Why This Matters for RL#

RL rollout is often not a uniform batch of short completions. Agentic and verifier-based workloads commonly have:

  • long prompts from tool/environment history;

  • multiple turns per sample;

  • long-tail decode latency;

  • session-local prefix cache opportunities;

  • different resource needs for actor, reference, reward, or judge models.

PD lets slime keep the training loop unchanged while using a rollout topology that matches the actual serving workload.

Operational Notes#

  • For new complex deployments, prefer --sglang-config over --prefill-num-servers.

  • Use router session affinity for multi-turn agents so turns from the same sample can reuse prefix cache. See Session-Affinity Routing.

  • Keep --rollout-num-gpus equal to the total GPUs described by the SGLang config.

  • Do not mix regular workers with prefill/decode workers inside the same model entry.

  • Tune prefill and decode TP separately when prompt processing and token generation have different bottlenecks.