Coding-Agent RL#
This directory provides an example of running end-to-end SWE (Software-Engineering) coding-agent RL with slime: a real coding agent (claude-code CLI) drives Read/Edit/Grep/Bash/Agent tools inside a fresh sandbox per sample, the model produces a git diff, and the diff is graded against the dataset’s test harness in a second clean sandbox (no test-cheating).
Three files implement the loop:
generate.py— per-samplegenerate()registered via--custom-generate-function-path. Boots the sandbox, runs claude-code, captures the diff, scores it, and emits one or moreSamples back to slime.middleware.py— Anthropic Messages API ↔ SGLang/generateshim. claude-code talks to it as if it were Anthropic; the shim tokenizes the current message history each turn, records prompt/output token snapshots, preserves model-generated tokens (loss_mask=1) only while later prompts stitch onto them, masks template/observation tokens (0), and emits three kinds of segments per trajectory:subagent(completedTask/Agentdispatch),wipe(chain frozen by auto-compact),final(tail of the main chain).sandbox.py— coding-agent/SWE helpers built onslime.agent.sandbox: install bootstraps, spawn claude-code, capture patches, and run the fresh-sandbox evaluator. The shared sandbox contract lives inslime.agent.sandbox.Sandbox.
Environment Setup#
The slime training stack itself follows the standard setup. On top of that you need:
An E2B-compatible sandbox cluster (or any provider that speaks the E2B SDK). Configure via
E2B_API_KEY(e.g. the standarde2b_xxxkey from https://e2b.dev, or any internal endpoint that accepts the same SDK).Host-side tarballs that get uploaded into each sandbox at boot:
Node 22 (
node-v22.x-linux-x64.tar.xz) — exported asSWE_HOST_NODE_TARBALL.Claude Code CLI npm tarball (
anthropic-ai-claude-code-local-linux-x64.tgz) — exported asSWE_HOST_CC_TARBALL.
A sandbox metadata file (
SWE_SANDBOX_METADATA_FILE, or the genericSLIME_AGENT_SANDBOX_METADATA_FILE) — JSON dict whose keys are passed as routing tags when booting an E2B sandbox. Must contain the image key referenced bySWE_SANDBOX_IMAGE_METADATA_KEY/SLIME_AGENT_SANDBOX_IMAGE_METADATA_KEY(e.g.image).Network reachability: each sandbox dials back to the slime head node’s middleware over
http://${SLIME_HEAD_HOST}:${SHIM_PORT}. The head host must be reachable from inside the sandboxes (setSLIME_HEAD_HOSTto a routable IP, not127.0.0.1).
Dataset Format#
Standard slime JSONL with three keys:
{
"prompt": "<falls back here if metadata.problem_statement is missing>",
"label": "<instance_id or grader label>",
"metadata": {
"image": "swedev/scaleswe.oh.34:<tag>", // sandbox image reference
"workdir": "/workspace/<repo>", // repo path inside the sandbox
"problem_statement": "<issue body>",
// exactly one of the following two graders:
"swepro": { /* SWE-bench Pro test harness — preferred */ },
"eval_cmd": "pytest -x tests/..." // last-resort: exit 0 = solved
// sweb-style rows: metadata.remote_env_info.f2p_script (Python file
// ending in `sys.exit(pytest.main(...))`) is auto-wrapped into eval_cmd.
}
}
Wire it up with --input-key prompt --label-key label --metadata-key metadata.
Running the Script#
Override the paths at the top of the launcher, then run from a long-lived shell on the Ray head node (do not wrap in nohup — Ray child processes get cleaned up with it):
cd slime/
export HF_CHECKPOINT=/path/to/Qwen3.6-35B-A3B
export REF_MODEL_PATH=/path/to/Qwen3.6-35B-A3B_torch_dist
export PROMPT_DATA=/path/to/swe_train.jsonl
export SANDBOX_METADATA_FILE=/path/to/sandbox_metadata.json
export SWE_HOST_NODE_TARBALL=/path/to/node-v22.20.0-linux-x64.tar.xz
export SWE_HOST_CC_TARBALL=/path/to/anthropic-ai-claude-code-local-linux-x64.tgz
bash examples/coding_agent_rl/run_qwen36_35b_a3b_swe_8nodes.sh
The launcher brings up Ray across all hosts in /root/mpi_rack_hostfile, dumps every rollout to runs/${EXP_TAG}_${STAMP}/rollout_dumps/, and tees stdout into runs/${EXP_TAG}_${STAMP}/run.log.
New Arguments#
generate.py is wired in through slime’s standard custom-generate hook:
ROLLOUT_ARGS=(
--custom-generate-function-path examples.coding_agent_rl.generate.generate
--prompt-data "${PROMPT_DATA}"
--input-key prompt
--label-key label
--metadata-key metadata
--rollout-batch-size 8
--n-samples-per-prompt 8
--rollout-max-context-len 96000
--rollout-max-response-len 32768
--rollout-stop-token-ids 248046 248044
--save-debug-rollout-data "${RUN_ROOT}/rollout_dumps/rollout_{rollout_id}.pt"
)
The SGLang server must expose Qwen3.6’s tool-call and reasoning parsers so claude-code’s tool invocations are parsed correctly:
SGLANG_ARGS=(
--sglang-tool-call-parser qwen3_coder
--sglang-reasoning-parser qwen3
...
)
SWE-specific Environment Knobs#
All set in the launcher; tune per cluster.
Variable |
Default |
Meaning |
|---|---|---|
|
|
Public IP the sandbox uses to reach the middleware. Must be routable from inside the sandbox. |
|
|
Bind address of the middleware shim on the head node. |
|
— |
E2B (or compatible) API key. |
|
— |
JSON dict of routing metadata passed at sandbox boot. |
|
— |
Which key in the metadata file holds the image reference (e.g. |
|
— |
Host path to Node 22 tarball uploaded into each sandbox. |
|
— |
Host path to the Claude Code CLI npm tarball. |
|
|
Wallclock budget for one agent run. |
|
|
Wallclock cap on the evaluator sandbox. |
|
|
Cap on simultaneous sandbox boots (eases h2/SSL long-tail). |
|
(see launcher) |
Extra flags appended to the |
|
unset |
Optional override for the user-turn prompt. Setting this to require sub-agent dispatch is the most reliable way to maximize fan-out. |
--rollout-max-response-len is the per-turn generation cap passed to each
SGLang /generate call as max_new_tokens. --rollout-max-context-len is the
multi-turn prompt+response budget: each turn clamps max_new_tokens to the
remaining context, and oversized emitted segments are dropped before training.
The middleware reuses --sglang-tool-call-parser and
--sglang-reasoning-parser for output parsing, so those flags must match the
served model.
Fan-out Semantics#
generate()returnslist[Sample]— one Sample per trajectory segment (subagent/wipe/final).Per-trajectory reward is split as
reward / Kacross segments;rollout_idis shared so the per-rollout-mean loss reducer still counts the trajectory once.Sub-agent dispatch increases
K(each completedAgentturn block becomes its own segment), so the effective batch after flatten can be much larger thanrollout_batch_size * n_samples_per_prompt.
Porting to a New Sandbox Backend#
slime.agent.sandbox.Sandbox exposes the shared sandbox contract, and
slime.agent.sandbox.E2BSandbox is the E2B implementation:
await sb.exec(cmd, user=..., check=..., timeout=...)
await sb.write_file(sandbox_path, content_or_host_path, user=...)
await sb.read_file(sandbox_path, user=...)
async with E2BSandbox(...) as sb: ...
Reimplement those on Docker / Modal / a local VM and everything in generate.py and middleware.py keeps working unchanged.