FAQ#
Why do I see garbled text during training?
This situation generally occurs because Megatron is not loaded correctly. Please check if there is a corresponding checkpoint in the directory specified by
--load
or--ref-load
. Note that Megatron can only load a directory that contains alatest_checkpointed_iteration.txt
file.If you need to specify a particular iteration, you can refer to the current Megatron usage instructions. Generally, you can specify the step number using
--ckpt-step
.Why is my task stuck on the Ray submission page?
Please check whether your task is set up for co-located training and inference or decoupled training and inference.
If it’s co-located (training and inference share the same GPUs), please check:
Whether the
--colocate
parameter is set to enable co-located mode.Whether the total number of GPUs for the current task is greater than or equal to
actor_num_nodes * actor_num_gpus_per_node
.
If it’s decoupled, please check:
Whether the total number of GPUs for the current task is greater than or equal to
actor_num_nodes * actor_num_gpus_per_node + rollout_num_gpus
.
Why did I encounter an Out-of-Memory (OOM) error during training? What is
max_tokens_per_gpu
for?OOM errors often happen because
max_tokens_per_gpu
is set too high. This parameter defines the maximum number of tokens that can be processed on each GPU during training. If you are concerned about OOM, you can initially set this value torollout_max_response_len / cp_size
and then increase it later to improve training efficiency. Note that--max-tokens-per-gpu
is only active when--use-dynamic-batch-size
is enabled.If you still experience OOM with a small
max_tokens_per_gpu
, check if the data generated in a single pass is too long. You may need to enable context parallelism (CP) with--context-parallel-size
. If you are using custom data generation, check if the total length of multi-turn generations is much longer than expected.During multi-node training, what should I do if the
transformers
library reports it cannot find a model?This usually happens when multiple processes try to read local files simultaneously using methods like
AutoConfig.from_pretrained
orAutoModelForCausalLM.from_pretrained
, causing file system write conflicts. You can mitigate this issue by setting the--model-name
argument.How do I resume training?
Simply set the
--load
directory to your--save
directory.How is the batch size calculated?
A single rollout uses
rollout_batch_size
prompts. For each prompt,n_samples_per_prompt
samples are generated. Therefore, one rollout contains a total ofrollout_batch_size * n_samples_per_prompt
data entries.You can use
--num-steps-per-rollout
to determine how many steps to run per rollout. This is equivalent to setting theglobal_batch_size
torollout_batch_size * n_samples_per_prompt // num_steps_per_rollout
.Does slime perform data packing / variable-length (varlen) processing?
Yes. Data packing refers to the process of concatenating samples of varying lengths during training to improve GPU utilization. slime performs this operation by default.
What should I do if the sglang component shows a
Max retries exceeded with url: /get_model_info (Caused by NewConnectionError)
error?This issue primarily stems from port conflicts caused by multiple sglang servers running on a single machine. We are currently working with the sglang team to resolve this. A temporary workaround is to minimize the number of sglang servers on a single machine, for example, by setting
tp=8
.My gradient norm is very high and the training crashes. What should I do?
First, ensure that your data and model are compatible. For example, if your data already uses a chat template, check if this template matches the one used by the original model. If the data is correct, please refer to our Debug Guide for a more in-depth analysis.
My sglang generation takes an extremely long time, GPU power is maxed out, and there’s no output for a long while. Why?
Please verify that the model corresponding to
--hf-checkpoint
has its stop tokens configured correctly. If not, you can set them using the--rollout-stop
or--rollout-stop-token-ids
arguments.Sglang shows an
an illegal memory access was encountered
error.According to the sglang documentation (https://docs.sglang.ai/references/troubleshooting.html), this could be an OOM error. Consider reducing the value of
--sglang-mem-fraction-static
.A
JSONDecodeError
occurs related to torch compile/inductor.This is generally an issue with the torch compiler’s cache read/write operations. You can try adding
"TORCHINDUCTOR_FORCE_DISABLE_CACHES": "1"
to theenv_vars
in your Ray configuration.Gradient becomes NaN or Inf during training.
You can try setting the
--no-check-for-nan-in-loss-and-grad
flag to skip the corresponding training steps.