On-Policy Distillation#
On-policy distillation (OPD) enables a student model to learn from a larger teacher model by training on its own rollouts while matching the teacher’s token-level log-probabilities. OPD is orthogonal to advantage estimators — it works as an additive KL penalty on top of any estimator (GRPO, PPO, REINFORCE++, etc.).
Key Arguments#
Argument |
Description |
|---|---|
|
Enable on-policy distillation. Required flag to use OPD. |
|
Type of OPD: |
|
OPD KL penalty coefficient (default: 1.0). Controls the weight of the distillation signal relative to the RL advantage. |
|
Path to teacher Megatron checkpoint. Required when |
|
Optional checkpoint step for teacher model. |
How It Works#
OPD modifies the advantage computation by subtracting a KL penalty term that encourages the student to match the teacher’s output distribution:
Where \(A_t\) is the original advantage from the base estimator (e.g., GRPO), \(\lambda_{\text{opd}}\) is --opd-kl-coef, and \(D_{\text{KL}}\) is the token-level reverse KL divergence.
This means OPD can be combined with any advantage estimator, including GRPO, PPO, REINFORCE++, and GSPO.
Two Teacher Modes#
SGLang Mode (--opd-type sglang)#
The teacher runs on an external SGLang server. Teacher log-probs are obtained during the rollout phase.
When to use: The teacher has a different architecture from the student, or the teacher is too large to load alongside the training model.
How it works:
An external SGLang server runs the teacher model.
During rollout, the custom reward function (
slime.rollout.on_policy_distillation.reward_func) sends each sample to the teacher server to obtain token-level log-probs.The custom post-processing function (
slime.rollout.on_policy_distillation.post_process_rewards) trims the teacher log-probs to the response span and stores them insample.teacher_log_probs.During training, the KL penalty is computed from the stored teacher log-probs and applied to advantages.
Configuration:
--use-opd
--opd-type sglang
--opd-kl-coef 1.0
--custom-rm-path slime.rollout.on_policy_distillation.reward_func
--custom-reward-post-process-path slime.rollout.on_policy_distillation.post_process_rewards
--rm-url http://<TEACHER_IP>:<TEACHER_PORT>/generate
Megatron Mode (--opd-type megatron)#
The teacher model is loaded directly into Megatron via --opd-teacher-load. Teacher log-probs are computed during the training forward pass.
When to use: The teacher has the same architecture as the student/reference model and fits in GPU memory.
How it works:
The teacher model is loaded as an additional Megatron model during initialization.
During the training forward pass, the teacher model computes log-probs for each sample.
The KL penalty is computed inline and applied to advantages.
Configuration:
--use-opd
--opd-type megatron
--opd-kl-coef 1.0
--opd-teacher-load /path/to/teacher_torch_dist
Note: The teacher checkpoint must be in Megatron format (
torch_distortorch). You can convert from HuggingFace format usingtools/convert_hf_to_torch_dist.py.
Running the Examples#
Complete example scripts are provided in examples/on_policy_distillation/:
SGLang Teacher#
# 1. Download models and data
hf download Qwen/Qwen3-32B --local-dir /root/Qwen3-32B
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
# 2. Convert student model
cd /root/slime
source scripts/models/qwen3-8B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-8B \
--save /root/Qwen3-8B_torch_dist
# 3. Run
bash examples/on_policy_distillation/run-qwen3-8B-opd.sh
Megatron Teacher#
# 1. Convert both student and teacher models to Megatron format
# 2. Run
bash examples/on_policy_distillation/run-qwen3-8B-opd-megatron.sh
Preliminary Results#
Using Qwen3-8B-Base model SFT-ed on part of the OpenThoughts3-1.2M dataset, on-policy distillation with a Qwen3-32B teacher on the remaining data yields:
Pass@1 |
|
|---|---|
Qwen3-8B-Base + SFT |
76% |
Qwen3-8B-Base + SFT + On-Policy Distillation |
94% |