Agentic RL Training Roadmap#

slime is not limited to single-turn RL. Its main advantage for agentic training is the combination of high-performance training, SGLang rollout serving, and pluggable data-generation interfaces. This makes it suitable for multi-turn tool use, sandbox interaction, subagent branches, context compaction, and test-based rewards.

This page is a roadmap: use it to decide which docs and examples to read when plugging an agent workflow into slime.

Where To Start#

Goal

Recommended entry point

Run a custom agent loop, tool calls, RAG, browser/terminal/sandbox interaction for each sample

--custom-generate-function-path, writing a custom generation function

Implement verifier rewards, test-based rewards, environment success checks, or an external reward service

--custom-rm-path, writing a custom reward function

Return multiple training samples from one prompt, such as subagent, multi-agent, or context-compaction segments

fan-out return from custom generate, examples/multi_agent

Avoid blocking training on long-tail agent rollouts

examples/fully_async

Study a full end-to-end agent example with sandboxing, real code edits, and test-based grading

examples/coding_agent_rl

Improve SGLang serving throughput for multi-turn agents

PD Disaggregation, SGLang Config

Enable SGLang optimization flags, router policies, or multi-model serving

How to Use SGLang, SGLang Config, Speculative Decoding, Low Precision Training

Agent Serving And Performance#

Agentic rollouts tend to depend more heavily on serving configuration than ordinary single-turn generation: contexts are longer, requests are multi-turn, latency has a heavier tail, and the workflow may need actor, reference, reward, or tool-side models at the same time.

  • Regular SGLang server arguments are passed as --sglang-*. For example, SGLang’s --context-length becomes --sglang-context-length, and --mem-fraction-static becomes --sglang-mem-fraction-static.

  • Router arguments are passed as --router-*. For multi-turn agents, consider --router-policy consistent_hashing so requests for the same sample.session_id go to the same worker and improve prefix-cache hit rate. See Session-Affinity Routing for Multi-Turn Agents.

  • Use --sglang-config for more complex topologies: PD disaggregation, multi-model serving, heterogeneous server groups, and per-group SGLang overrides.

  • For multi-turn or agentic RL, evaluate PD disaggregation. Prefill and decode have different workload shapes, and separating them makes it easier to scale each resource independently.

  • For rollout-throughput optimization, also see Speculative Decoding and Low Precision Training.

Reference Example#

The full coding-agent example is examples/coding_agent_rl. It shows an end-to-end agent RL setup that is close to a real software-engineering workflow: each sample boots an isolated sandbox, the agent uses tools to edit code, the rollout captures a git diff, and a clean sandbox runs the tests to produce the reward.

This example also demonstrates agent fan-out training. Its middleware splits one trajectory into subagent, wipe (the chain frozen before compaction), and final segments. generate() returns list[Sample], and all segments share the same rollout_id.

For smaller starting points, see examples/search-r1 for multi-turn tool use, examples/retool for tool-augmented generation, and examples/multi_agent for the multi-agent pattern.