On-Policy Distillation Example#
This example shows how to run on-policy distillation using Slime. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities.
In this example, the teacher model acts as a reward model (RM) by providing teacher log probabilities as the supervision signal.
Components#
on_policy_distillation.pyimplements::reward_funccalls the teacher server (viaargs.rm_url) with every sample to obtain token-level logprobs.post_process_rewardstrims the teacher logprobs to the generated response span and writes the tensors back to eachSampleto compute advantages.
run-qwen3-8B-opd.shlaunches an SGLang teacher server, then submits a Ray job that runstrain.py.
Running the example#
Download or prepare the required checkpoints and data.
hf download Qwen/Qwen3-32B --local-dir /root/Qwen3-32B
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
hf download zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
Run the hf to mcore for student model conversion:
source "${HOME_DIR}/slime/scripts/models/qwen3-8B.sh"
PYTHONPATH=/root/Megatron-LM:${HOME_DIR}/slime python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint ${HOME_DIR}/checkpoints/Qwen/Qwen3-8B \
--save ${HOME_DIR}/checkpoints/Qwen/Qwen3-8B_torch_dist
run on-policy distillation:
bash examples/on_policy_distillation/run-qwen3-8B-opd.sh
Preliminary Results#
Using Qwen3-8B-Base model sfted on part of the OpenThoughts3-1.2M dataset, we performed on-policy distillation with a Qwen3-32B teacher on the remaining data. Evaluation on Math500 shows:
Pass@1 |
|
|---|---|
Qwen3-8B-Base + SFT |
76% |
Qwen3-8B-Base + SFT + On-Policy Distillation |
94% |
FAQ#
Why are teacher logits computed via a sglang server instead of inside the training backend? The teacher runs on an independent SGLang server that Slime treats as a reward model. Hosting it inside Megatron/FSDP would require maintaining a second, fully configured training stack for the teacher.
References#
https://thinkingmachines.ai/blog/on-policy-distillation/
https://arxiv.org/abs/2306.13649
https://arxiv.org/abs/2306.08543