AMD#
⚠️ If you encounter problems on AMD instinct, feel free to reach out Yusheng Su.
Introduction#
If you are running slime on AMD’s Instinct, please refer to the following materials. This tutorial will explain how to set up the development environment (Docker), use the modified ROCm dependencies, and provide an example for running the experiments. The current rocm docker only support AMD’s MI300 and MI325 GPUs.
Docker#
You can download the prebuilt image from DockerHub: rlsys/slime.
docker pull rlsys/slime:latest
Or you can use the Dockerfile.rocm to build it on your side.
cd docker
docker build -f Dockerfile.rocm -t rlsys/slime:latest .
Acknowledgement: Thanks to Yang Wang for working on the patch for this ROCm base Docker image to support virtual memory management on MI300X.
Quick Start#
Environment Setup#
Based on the rlsys/slime image (pre-installed with SGLang and Megatron-LM):
docker run --rm -it \
--device /dev/dri \
--device /dev/kfd \
-p 8265:8265 \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME/.ssh:/root/.ssh \
-v $HOME:$HOME \
--shm-size 128G \
--name slime_dev \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-w $PWD \
rlsys/slime:latest \
/bin/bash
Then, download and install slime:
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps
Download the model and data:
# hf checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
# train data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# eval data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Checkpoint Format Conversion#
Since slime uses Megatron, and Megatron does not support loading Hugging Face checkpoints directly, we need to convert the model to the torch_dist format that Megatron supports.
HF → Megatron torch_dist ckpt#
Use mbridge or Megatron-LM-amd_version-amd for conversion:
cd slime/
source scripts/models/qwen3-4B.sh
MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')
PYTHONPATH=${MEGATRON_LM_PATH} python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--no-gradient-accumulation-fusion \
--hf-checkpoint /root/Qwen3-4B \
--save /root/Qwen3-4B_torch_dist
Note: We implemented a dedicated AMD conversion script that forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.
⚠️ If you encounter an issue where slime cannot be found, please run pip install -e . --no-deps in the slime directory.
Example: Qwen3-4B#
We provide examples to use Qwen3-4B, please refer to:
Example: Qwen3-4B Model: Just run
SLIME_DIR=/root \
MODEL_DIR=/root \
DATA_DIR=/root \
bash scripts/run-qwen3-4B-amd.sh
⚠️ TODO: ROCM seems to not support apex yet. Thus, we need to disable gradient accumulation fusionby adding the --no-gradient-accumulation-fusion flag in the training script currently. We will continue investigating how to enable this.
⚠️ Note: The main difference between ROCm’s training script and NVIDIA’s script is that you need to set RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES for ray to function properly on AMD GPUs.
We show the training script below:
#!/bin/bash
# for rerun the task
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
set -euxo pipefail
### AMD Support ###
SLIME_DIR="${SLIME_DIR:-/home/yushensu/projects/slime}" # Default path if not set in environment
export SLIME_DIR
MODEL_DIR="${MODEL_DIR:-/home/yushensu/projects/model}" # Default path if not set in environment
export MODEL_DIR
DATA_DIR="${DATA_DIR:-/home/yushensu/projects/data}" # Default path if not set in environment
export DATA_DIR
# For AMD GPU
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=${RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES:-"1"} # Must set to 1
export HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES:-"0,1,2,3,4,5,6,7"} #You can choose which gpus to use
####################
# will prevent ray from buffering stdout/stderr
export PYTHONBUFFERED=16
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"
CKPT_ARGS=(
--hf-checkpoint ${MODEL_DIR}/Qwen3-4B
--ref-load ${MODEL_DIR}/Qwen3-4B_torch_dist
--load ${MODEL_DIR}/Qwen3-4B_slime/
--save ${MODEL_DIR}/Qwen3-4B_slime/
--save-interval 20
)
ROLLOUT_ARGS=(
--prompt-data ${DATA_DIR}/dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data aime ${DATA_DIR}/aime-2024/aime-2024.jsonl
--n-samples-per-eval-prompt 16
--eval-max-response-len 16384
--eval-top-p 1
)
PERF_ARGS=(
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
# --micro-batch-size 1
--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)
WANDB_ARGS=(
# --use-wandb
# --wandb-project slime-dev
# --wandb-group qwen3-4B-test
# --wandb-key ${WANDB_KEY}
)
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 2
--sglang-mem-fraction-static 0.7
)
MISC_ARGS=(
# default dropout in megatron is 0.1
--attention-dropout 0.0
--hidden-dropout 0.0
# should be good for model performance
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
# need to comment this when using model with MLA
--attention-backend flash
### AMD Support ###
# disable gradient accumulation fusion: Need to add apex to enable this
--no-gradient-accumulation-fusion
###################
)
# launch the master node of ray in container
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
NUM_GPUS=$(echo ${HIP_VISIBLE_DEVICES} | tr ',' '\n' | wc -l)
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus ${NUM_GPUS} --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
# "PYTHONPATH": "/workspace/Megatron-LM/",
MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{
"env_vars": {
"PYTHONPATH": "/workspace/Megatron-LM/",
"CUDA_DEVICE_MAX_CONNECTIONS": "1"
}
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
####clear after training
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python