Megatron Config: Role-Based Training Overrides#
--megatron-config-path is a YAML-based configuration system for applying role-specific overrides on top of the shared Megatron CLI arguments. Today it is mainly intended for PPO actor / critic configuration.
Unlike --sglang-config, --megatron-config-path does not manage deployment, routing, or GPU orchestration. Its only job is to decide which training arguments each role should finally use.
Design Overview#
By default, when --megatron-config-path is not used, both actor and critic inherit the Megatron / slime CLI arguments directly.
With --megatron-config-path, the configuration is split into two layers:
Shared CLI arguments define the common Megatron topology, resource allocation, and default training parameters.
Role-level YAML overrides only specify the fields that should differ between actor and critic.
Key design principles:
CLI remains the shared baseline. slime first parses the normal CLI arguments, then applies the YAML role overrides.
Missing roles inherit automatically. If a role is absent from the YAML file, it simply keeps the CLI arguments unchanged.
Resource allocation is still controlled by CLI.
num_nodesandnum_gpus_per_nodein YAML are ignored; placement is still controlled by--actor-num-*/--critic-num-*.
Config Format#
The config file is a YAML document whose top-level megatron key contains a list of role entries:
megatron:
- name: default
role: actor
overrides:
lr: 1e-6
save: /path/to/actor_ckpt
- name: default
role: critic
overrides:
lr: 1e-5
save: /path/to/critic_ckpt
Field Reference#
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
Optional |
Label for this entry. The runtime does not depend on it today, but keeping |
|
|
Required |
Role name. Currently supported values are |
|
|
|
Role-specific argument overrides applied on top of the shared CLI arguments. |
|
|
|
Backward-compatible alias for |
Note: Keys inside
overridesuse argparse attribute names, not CLI flag names. For example, usetensor_model_parallel_sizerather thantensor-model-parallel-size.
Usage Pattern#
A typical PPO setup looks like this:
# megatron_ppo.yaml
megatron:
- name: default
role: actor
overrides:
lr: 1e-6
- name: default
role: critic
overrides:
lr: 1e-5
python train.py \
--advantage-estimator ppo \
--use-critic \
--megatron-config-path megatron_ppo.yaml \
--tensor-model-parallel-size 2 \
--sequence-parallel \
--pipeline-model-parallel-size 1 \
--context-parallel-size 1 \
--expert-model-parallel-size 1 \
--expert-tensor-parallel-size 1 \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--critic-num-nodes 1 \
--critic-num-gpus-per-node 8 \
...
In this setup:
CLI defines the shared topology and resource layout.
YAML defines the role-specific differences, such as
lr,load,save, or optimizer / scheduler parameters.
Overriding Only One Role#
You can also override only one role and let the other inherit the shared CLI configuration. For example, changing only the critic learning rate:
megatron:
- name: default
role: critic
overrides:
lr: 1e-5
In this case the actor keeps the shared CLI arguments unchanged.
Current Limitations#
PPO only for now.
--megatron-config-pathis currently intended for PPO actor / critic role configuration. It is not the recommended interface for GRPO, REINFORCE++, and other critic-free workflows.Actor and critic must use the same Megatron parallel topology in current PPO. In particular, topology-related settings such as
tensor_model_parallel_size,pipeline_model_parallel_size,context_parallel_size,expert_model_parallel_size,expert_tensor_parallel_size, andsequence_parallelshould not differ between actor and critic.Keep topology-related settings on CLI. The safest current pattern is to keep parallelism and resource arguments in the shared CLI configuration, and only put role-specific differences in YAML, such as
lr,load,save, warmup, and optimizer / scheduler settings.
If you configure different parallel topologies for actor and critic, the behavior is currently unsupported and may fail during initialization or training.
FAQ#
Q: Can I provide only an actor entry or only a critic entry?#
Yes. Missing roles automatically inherit the shared CLI arguments, so you do not need to duplicate everything in YAML.
Q: Can I move --actor-num-nodes or --critic-num-gpus-per-node into YAML?#
No. Resource allocation and placement groups are still controlled by CLI arguments, and the corresponding YAML fields are ignored.