Customization Guide#

slime provides extensive customization capabilities through function path arguments. These allow you to inject custom logic at various stages of the training and rollout pipeline without modifying the core codebase.

Overview of Customization Interfaces#

Below is a summary of all available customization interfaces and their purposes.

Interface Argument

Purpose

–rollout-function-path

Override the entire rollout generation logic.

–custom-generate-function-path

Override only the generation step (e.g., for RAG or tool use).

–custom-rm-path

Implement custom reward computation logic.

–dynamic-sampling-filter-path

Filter samples during dynamic sampling (e.g., DAPO).

–buffer-filter-path

Filter samples in the rollout buffer before training.

–rollout-sample-filter-path

Determine if individual samples participate in loss calculation.

–rollout-all-samples-process-path

Process all samples (including filtered ones) after rollout.

–rollout-data-postprocess-path

Post-process rollout data after log probs are computed.

–custom-loss-function-path

Implement custom training loss computation.

–custom-tis-function-path

Implement custom importance sampling for off-policy correction.

–custom-pg-loss-reducer-function-path

Customize pg_loss reduction (e.g., for Dr.GRPO).

–custom-reward-post-process-path

Custom post-processing of rewards before advantage computation.

–custom-convert-samples-to-train-data-path

Override the conversion of samples to training data format.

–custom-rollout-log-function-path

Custom logging for training rollouts.

–custom-eval-rollout-log-function-path

Custom logging for evaluation rollouts.

–data-source-path

Override the data source for rollout prompts.

–eval-function-path

Override the rollout function specifically for evaluation.

–custom-megatron-init-path

Custom initialization after Megatron setup.

–custom-megatron-before-log-prob-hook-path

Custom logic before log probability computation.

–custom-megatron-before-train-step-hook-path

Custom logic before each training step.

–slime-router-middleware-paths

Add custom middleware to slime router.

Detailed Interface Reference#

1. Rollout Function (--rollout-function-path)#

Default: slime.rollout.sglang_rollout.generate_rollout

Purpose: Override the entire rollout generation logic.

Signature:

async def generate_rollout(args, rollout_id, *, evaluation=False) -> RolloutFnTrainOutput | RolloutFnEvalOutput

Use Cases:

  • Implementing complex multi-turn conversations

  • Adding custom sampling strategies

  • Integrating external tools or APIs during generation

Example: See examples/multi_agent/rollout_with_multi_agents.py


2. Custom Generate Function (--custom-generate-function-path)#

Default: None (uses built-in generate function)

Purpose: Override only the generation step within the default rollout function.

Signature:

async def custom_generate(args, sample: Sample, sampling_params: dict) -> Sample

Use Cases:

  • Implementing tool-calling or function-calling capabilities

  • Adding retrieval-augmented generation (RAG)

  • Multi-turn conversation handling

Example: See examples/search-r1/generate_with_search.py


3. Reward Model (--custom-rm-path)#

Default: None (uses built-in reward models based on --rm-type)

Purpose: Implement custom reward computation logic.

Signature (single sample mode):

async def custom_rm(args, sample: Sample) -> float

Signature (batch mode, when --group-rm is enabled):

async def batched_custom_rm(args, samples: list[Sample]) -> list[float]

Use Cases:

  • Custom rule-based rewards

  • Integration with external reward model services

  • Multi-dimensional reward signals

Built-in Options (--rm-type):

  • math: Mathematical answer verification

  • dapo: DAPO-style scoring

  • deepscaler: DeepScaler rule-based reward

  • f1: F1 score computation

  • gpqa: GPQA reward computation

  • ifbench: IFBench reward computation

  • remote_rm: Remote reward model service (requires --rm-url)


4. Dynamic Sampling Filter (--dynamic-sampling-filter-path)#

Default: None

Purpose: Filter samples during dynamic sampling (e.g., DAPO-style filtering).

Signature:

def filter_function(args, samples: list[Sample], **kwargs) -> DynamicFilterOutput

Return Type:

@dataclass
class DynamicFilterOutput:
    keep: bool  # Whether to keep this sample group
    reason: str | None  # Reason for filtering (for logging)

Use Cases:

  • Filtering out samples where all responses have the same reward

  • Implementing curriculum learning strategies

  • Quality-based sample selection

Example: slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std


5. Buffer Filter (--buffer-filter-path)#

Default: None

Purpose: Filter samples in the rollout buffer before training.

Signature:

def buffer_filter(samples: list[list[Sample]]) -> list[list[Sample]]

Use Cases:

  • Removing low-quality samples before training

  • Implementing priority-based sample selection

  • Balancing sample distributions


6. Rollout Sample Filter (--rollout-sample-filter-path)#

Default: None

Purpose: Determine whether individual samples participate in loss calculation.

Signature:

def filter_function(args, samples: list[Sample]) -> None

Note: This function should directly modify the remove_sample attribute of each Sample object.

Use Cases:

  • Filtering samples based on response quality

  • Implementing selective training strategies


7. Rollout All Samples Process (--rollout-all-samples-process-path)#

Default: None

Purpose: Process all samples (including filtered ones) after rollout.

Signature:

def process_function(args, samples: list[list[Sample]]) -> None

Use Cases:

  • Logging and analysis of all generated samples

  • Computing statistics across filtered and kept samples


8. Rollout Data Postprocess (--rollout-data-postprocess-path)#

Default: None

Purpose: Post-process rollout data after log probabilities are computed.

Signature:

def postprocess_function(args, samples: list[list[Sample]]) -> None

Use Cases:

  • Updating loss masks based on computed values

  • Adding additional metadata to samples


9. Custom Loss Function (--custom-loss-function-path)#

Default: None (requires --loss-type custom_loss)

Purpose: Implement custom training loss computation.

Use Cases:

  • Novel RL objectives

  • Multi-objective optimization

  • Custom regularization terms


10. Custom TIS/RS Function (--custom-tis-function-path)#

Default: None

Purpose: Implement custom importance sampling for off-policy correction.

Use Cases:

  • Custom importance sampling ratio computation

  • Advanced off-policy correction methods

Example: examples/train_infer_mismatch_helper/mis.py:compute_mis_weights_with_cp


11. Custom pg_loss Reducer (--custom-pg-loss-reducer-function-path)#

Default: None

Purpose: Customize the reduction of pg_loss while other metrics (pg_clipfrac, ppo_kl, entropy_loss, etc.) still use the default sum_of_sample_mean.

Signature:

def get_pg_loss_reducer(
    total_lengths: list[int],
    response_lengths: list[int],
    loss_masks: list[torch.Tensor],
    calculate_per_token_loss: bool = False,
) -> Callable[[torch.Tensor], torch.Tensor]

Use Cases:

  • Dr.GRPO: Divide by a constant instead of effective token count

  • Custom loss normalization strategies

Example: examples/DrGRPO/custom_reducer.py:get_pg_loss_reducer


12. Reward Post-Processing (--custom-reward-post-process-path)#

Default: None (uses default GRPO normalization)

Purpose: Custom post-processing of rewards before advantage computation.

Use Cases:

  • Custom reward normalization strategies

  • Reward shaping


13. Samples to Train Data Conversion (--custom-convert-samples-to-train-data-path)#

Default: None (uses built-in conversion logic)

Purpose: Override the conversion of samples to training data format.

Signature:

def convert_samples_to_train_data(
    args,
    samples: list[Sample] | list[list[Sample]],
) -> dict

Return Type:

dict: {
    "tokens": list[list[int]],           # Token IDs for each sample
    "response_lengths": list[int],        # Response lengths
    "rewards": list[float],               # Normalized rewards
    "raw_reward": list[float],            # Raw rewards
    "truncated": list[int],               # Truncation flags (0 or 1)
    "sample_indices": list[int],          # Sample indices
    "loss_masks": list[list[int]],        # Loss masks for each sample
    # Optional fields:
    "round_number": list[int],            # Round numbers (for rollout buffer)
    "rollout_log_probs": list,            # Log probs (for off-policy correction)
    "rollout_routed_experts": list,       # Routed experts (for MoE)
    "metadata": list,                     # Train metadata
    "multimodal_train_inputs": list,      # Multimodal tensors (for VLM)
    "teacher_log_probs": list,            # Teacher log probs (for distillation)
}

Use Cases:

  • Handling list[list[Sample]] inputs

  • Custom data format requirements for training


14. Logging Functions#

Training Rollout Logging (--custom-rollout-log-function-path)#

Signature:

def log_rollout_data(rollout_id, args, samples, rollout_extra_metrics, rollout_time) -> bool

Return: True to skip default logging, False to continue with default logging.

Evaluation Rollout Logging (--custom-eval-rollout-log-function-path)#

Signature:

def log_eval_rollout_data(rollout_id, args, data, extra_metrics) -> bool

Return: True to skip default logging, False to continue with default logging.


15. Data Source (--data-source-path)#

Default: slime.rollout.data_source.RolloutDataSourceWithBuffer

Purpose: Override the data source for rollout prompts.

Base Class: slime.rollout.data_source.DataSource

Required Methods:

class CustomDataSource(DataSource):
    def get_samples(self, num_samples: int) -> list[list[Sample]]:
        """Return num_samples samples"""
        
    def add_samples(self, samples: list[list[Sample]]):
        """Add samples back to the data source"""
        
    def save(self, rollout_id):
        """Save state for checkpointing"""
        
    def load(self, rollout_id=None):
        """Load state from checkpoint"""

16. Evaluation Function (--eval-function-path)#

Default: Same as --rollout-function-path

Purpose: Override the rollout function specifically for evaluation.

Use Cases:

  • Different sampling parameters for evaluation

  • Evaluation-specific logic


17. Megatron Hooks#

Megatron Initialization (--custom-megatron-init-path)#

Signature:

def custom_init(args) -> None

Purpose: Custom initialization after Megatron setup.

Before Log Prob Hook (--custom-megatron-before-log-prob-hook-path)#

Signature:

def custom_hook(args, model, store_prefix) -> None

Purpose: Custom logic before log probability computation.

Before Train Step Hook (--custom-megatron-before-train-step-hook-path)#

Signature:

def custom_hook(args, rollout_id, step_id, model, optimizer, opt_param_scheduler) -> None

Purpose: Custom logic before each training step.


18. slime Router Middleware (--slime-router-middleware-paths)#

Purpose: Add custom middleware to the slime router for request processing.

Use Cases:

  • Request/response transformation

  • Custom routing logic

  • Caching and optimization


19. MoE Routing Replay#

Stabilize MoE RL training by recording and replaying expert routing decisions to ensure consistency.

Argument

Description

--use-routing-replay

Forward-backward routing consistency in training. (arXiv:2507.18071)

--use-rollout-routing-replay

R3: Replay routing from rollout during training. Requires --use-slime-router. (arXiv:2510.11370)