Observability#

slime’s default observability path is intentionally small: training metrics still go to W&B / TensorBoard; high-frequency SGLang Prometheus metrics are no longer uploaded to W&B; request timings from SGLang response meta_info are stored in sample traces and aggregated once per rollout step as compact perf/... metrics.

W&B / TensorBoard Metrics#

W&B and TensorBoard still receive reward, loss, KL, entropy, eval, and other training metrics. SGLang request timing summaries are logged under perf/, for example:

perf/request/e2e_latency/mean
perf/request/queue_time/median
perf/request/count
perf/request/profiled_count
perf/decode/throughput/mean
perf/prefill/bootstrap_queue_duration/mean
perf/prefill/bootstrap_duration/mean
perf/prefill/alloc_wait_duration/mean
perf/prefill/forward_duration/max
perf/prefill/transfer_speed_gb_s/mean
perf/decode/prealloc_duration/mean
perf/decode/bootstrap_duration/mean
perf/decode/alloc_wait_duration/mean
perf/decode/transfer_duration/max
perf/decode/forward_duration/mean

These metrics are aggregated once per rollout step, not emitted once per request, so they should not slow W&B like uploading raw Prometheus metrics would.

Without PD, common perf/request/... metrics and available perf/decode/throughput/... metrics still exist. Detailed perf/prefill/... and perf/decode/...duration metrics only appear when SGLang returns the corresponding pd_* timing fields.

Where Prometheus Data Is Stored#

slime does not store per-second Prometheus data. SGLang / router only expose /metrics and /engine_metrics HTTP endpoints. Prometheus scrapes those endpoints periodically and stores the time series in Prometheus’s own TSDB.

That means:

  • If Prometheus is not running, serving metrics are only available from the current SGLang process memory and endpoint output, with no historical storage.

  • If Prometheus is running, history is stored under Prometheus’s --storage.tsdb.path.

  • slime does not upload these high-frequency metrics to W&B.

Useful SGLang metrics include:

sglang:num_queue_reqs
sglang:num_running_reqs
sglang:num_prefill_bootstrap_queue_reqs
sglang:num_prefill_inflight_queue_reqs
sglang:num_decode_prealloc_queue_reqs
sglang:num_decode_transfer_queue_reqs
sglang:kv_transfer_speed_gb_s_bucket
sglang:kv_transfer_latency_ms_bucket
sglang:kv_transfer_total_mb_bucket

These are useful in Prometheus / Grafana for live queue buildup, transfer speed, latency histograms, failure counters, and other serving-side symptoms.

Starting Prometheus#

Prometheus must run while training is running because it can only scrape endpoints that are currently alive. It does not need to run inside the training Python process; run it as a side process in the same machine or job.

A minimal config is:

global:
  scrape_interval: 10s

scrape_configs:
  - job_name: slime-sglang
    metrics_path: /engine_metrics
    static_configs:
      - targets:
          - "ROUTER_IP:ROUTER_PORT"

Replace ROUTER_IP:ROUTER_PORT with the router address printed by slime, or with the explicit --sglang-router-ip / --sglang-router-port values.

Start Prometheus with its TSDB path on persistent storage:

prometheus \
  --config.file=/path/to/prometheus.yml \
  --storage.tsdb.path=/path/to/prometheus-data \
  --storage.tsdb.retention.time=7d \
  --web.listen-address=0.0.0.0:9090

The slime image includes the prometheus binary, so this command can run directly inside the container. You can also start a side container from the same image as long as it can reach the router address and mounts /path/to/prometheus-data on persistent storage.

If --storage.tsdb.path points to container-local disk, the data is lost when the container is removed. If it points to NFS, a persistent volume, or a job output directory, you can restart Prometheus with the same TSDB directory after training and query the historical time range in the Prometheus UI or Grafana. This is time-series replay, not full per-request trace replay; per-sample request timings still come from sample traces / debug rollout data.

Trace Viewer#

Debug rollout dumps saved with --save-debug-rollout-data include sample traces. The trace viewer reads SGLang timing attrs directly from those traces and uses pd_* fields to render synthetic [P] / [D] lanes.

python tools/trace_timeline_viewer.py /path/to/debug/rollout_0.pt

The default path does not require separate ReqTimeStats(...) logs, Loki, or a compaction tool.