Fault Tolerance#
To ensure long-term, stable RL training, slime enables a certain level of fault tolerance by default. This section introduces the design philosophy behind fault tolerance in slime.
To enable the fault tolerance function in slime, please set --use-fault-tolerance
.
Rollout Fault Tolerance#
During the rollout process, slime periodically sends heartbeat requests (/health_generate
) to all SGLang servers. If a heartbeat times out, that SGLang server will be stopped. After the current rollout round is complete, the server will be restarted and its parameters will be correctly updated.
--rollout-health-check-first-wait
: Since some large MoE models require compilation on their first run, slime will wait forrollout_health_check_first_wait
seconds before the first rollout to start sending heartbeats. Defaults to 300s.--rollout-health-check-interval
: The interval between heartbeat checks. Defaults to 10s.--rollout-health-check-timeout
: The timeout limit for a heartbeat request. Defaults to 5s.