GLM-5.2 744B-A40B with 256xH100#
This is the recommended 32-node, 256-H100 training example for GLM-5.2.
The recipe uses the GLM-5.2 BF16 checkpoint for Megatron training and the FP8 checkpoint for SGLang rollout. It assumes two Hugging Face repositories will be available:
BF16:
zai-org/GLM-5.2FP8:
zai-org/GLM-5.2-FP8
Environment Setup#
For environment setup and dataset download, see Example: Qwen3-4B. For multi-node training, make sure every node can access the same $BASE_DIR path.
Download Model#
hf download zai-org/GLM-5.2 --local-dir $BASE_DIR/GLM-5.2
hf download zai-org/GLM-5.2-FP8 --local-dir $BASE_DIR/GLM-5.2-FP8
The open-source GLM-5.2 config uses model_type: glm_moe_dsa, which slime maps onto
the DeepSeek-V3.2 bridge (slime_plugins.mbridge.deepseek_v32) since the two share the
same DSA weight layout.
Convert Checkpoint#
The training side needs the BF16 Hugging Face checkpoint converted to the Megatron torch_dist format. The torch_dist format is reshardable, so the conversion parallel layout does not need to match training; we use a layout that satisfies Megatron’s expert-group constraint on the conversion node count.
Run the following on 4 nodes / 32 GPUs:
cd /root/slime
pip install -e . --no-deps
source scripts/models/glm5.2-744B-A40B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun \
--nproc-per-node 8 \
--master-addr ${MASTER_ADDR} --master-port 12345 \
--nnodes=4 --node-rank ${NODE_RANK} \
tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 2 \
--decoder-last-pipeline-num-layers 40 \
--expert-model-parallel-size 16 \
--expert-tensor-parallel-size 1 \
--hf-checkpoint $BASE_DIR/GLM-5.2/ \
--save $BASE_DIR/GLM-5.2_torch_dist/
Here, MASTER_ADDR is the IP of node0, and NODE_RANK is the current node index.
MODEL_ARGS includes --allgather-cp, a slime-only flag, so tools/convert_hf_to_torch_dist.py registers it too (it is a no-op for conversion). On 32 GPUs, Megatron requires expert_tp(1) * expert_model_parallel * pp to divide the world size, so we convert with EP=16 (1*16*2=32). The resulting checkpoint still loads at training-time EP=32 because torch_dist is reshardable.
Run Training#
From node0:
cd /root/slime
export BASE_DIR=/shared/path
export MASTER_ADDR=<node0-ip>
export HOSTFILE=$BASE_DIR/hostfile # one worker IP per line, all 32 nodes
bash scripts/run-glm5.2-744B-A40B.sh
If HOSTFILE is not set, join the other nodes to the Ray cluster manually.
Parameter Introduction#
Model Configuration#
scripts/models/glm5.2-744B-A40B.sh contains the GLM-5.2 DSA + cross-layer index sharing configuration: 256 routed experts, top-8 activation, 1 shared expert, and 78 layers total (3 dense + 75 MoE).
The DSA index sharing schedule, such as index_topk_freq=4 and index_skip_topk_offset=3, is read from the Hugging Face config. The Megatron side uses the shared slime_plugins.models.glm5.glm5:get_glm5_spec provider and enables:
--allgather-cp
This makes DSA + context parallel use the allgather-CP layout, and the index-share provider gathers index K/V across the CP group.
Training Parallelism#
The default script targets 32 nodes and 256 GPUs:
PERF_ARGS=(
--tensor-model-parallel-size 4
--pipeline-model-parallel-size 8
--decoder-first-pipeline-num-layers 14
--decoder-last-pipeline-num-layers 16
--context-parallel-size 8
--expert-model-parallel-size 32
--expert-tensor-parallel-size 1
...
)
TP=4 * PP=8 * CP=8 = 256 GPUs form one training group (DP=1). The expert group constraint expert_tp(1) * EP(32) * PP(8) = 256 divides the world size exactly (expert_dp=1).
DSA cross-layer index sharing requires every pipeline stage to start on a “computing” layer. With index_topk_freq=4 / index_skip_topk_offset=3, the computing layers are 1, 2, 3, 7, 11, …, 75. A uniform 78/8 split would start stages on skip layers and fail the index-share assertion in get_glm5_spec. We therefore use --decoder-first-pipeline-num-layers 14 and --decoder-last-pipeline-num-layers 16, leaving 6 middle stages of (78-14-16)/6 = 8 layers each. The stage starts land on global layers 1, 15, 23, 31, 39, 47, 55, 63 — all computing layers.
BF16 Training + FP8 Rollout#
The launcher writes the default paths directly in CKPT_ARGS and ROLLOUT_ARGS, matching the style of the other example scripts:
CKPT_ARGS=(
--hf-checkpoint $BASE_DIR/GLM-5.2-FP8
--ref-load $BASE_DIR/GLM-5.2_torch_dist
--load $BASE_DIR/GLM-5.2_slime
--save $BASE_DIR/GLM-5.2_slime
--save-interval 20
)
ROLLOUT_ARGS=(
--prompt-data $BASE_DIR/dapo-math-17k/dapo-math-17k.jsonl
...
)
--hf-checkpoint provides FP8 weights and the tokenizer for SGLang rollout; --ref-load is the Megatron torch_dist checkpoint converted from BF16. To debug BF16 rollout, change --hf-checkpoint in the script to $BASE_DIR/GLM-5.2.
SGLang Configuration#
The rollout side runs with prefill/decode (PD) disaggregation: 1 prefill engine (64 GPU) + 3 decode engines (192 GPU) = 256 GPUs total (which must equal the colocated rollout_num_gpus). Each engine spans 64 GPUs with DP attention and EP=64 (DeepEP’s dispatch config map supports up to 160 EP ranks, so a single 256-GPU engine would be invalid). Prefill uses the auto DeepEP path; decode uses low_latency + deep_gemm. The split is configured via the --sglang-config YAML:
sglang:
- name: default
server_groups:
- worker_type: prefill
num_gpus: 64
num_gpus_per_engine: 64
overrides: { deepep_mode: auto, ... }
- worker_type: decode
num_gpus: 192
num_gpus_per_engine: 64
overrides: { deepep_mode: low_latency, moe_runner_backend: deep_gemm, ... }
PD transfer runs over RDMA/IB with the mooncake backend:
--sglang-disaggregation-transfer-backend mooncake
--sglang-disaggregation-ib-device mlx5_100,...,mlx5_107
The rest of the rollout uses FP8 KV cache and the NSA + DeepEP backends:
SGLANG_ARGS=(
--sglang-enable-dp-attention
--sglang-ep-size 64
--sglang-dp-size 64
--sglang-kv-cache-dtype fp8_e4m3
--sglang-nsa-decode-backend flashmla_kv
--sglang-nsa-prefill-backend flashmla_sparse
--sglang-attention-backend nsa
...
)
MTP / EAGLE speculative decoding is enabled using the model’s own next-token-prediction layer (the GLM-5.2 checkpoint ships an MTP layer), so no separate draft model is needed:
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 4
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 5
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK must cover the largest decode batch: max cuda_graph_max_bs (decode group = 12) * speculative_num_draft_tokens (5) = 60, rounded up to 64. A value below this trips the DeepEP low-latency dispatch buffer assertion during the decode group’s CUDA-graph capture.
Networking#
DeepEP/NVSHMEM communication across nodes needs the IB-aware NCCL settings in the Ray runtime env (NCCL_SOCKET_IFNAME, NCCL_IB_*, NCCL_NET_GDR_LEVEL, NCCL_P2P_LEVEL=NVL, NCCL_NVLS_ENABLE=0, MC_IB_PCI_RELAXED_ORDERING, …). The script defaults to SOCKET_IFNAME=eth0; set SOCKET_IFNAME before launch if your environment differs, and it will be written to GLOO_SOCKET_IFNAME, TP_SOCKET_IFNAME, and NCCL_SOCKET_IFNAME. DeepEP also requires NVSHMEM_DISABLE_NCCL=1.