Changelog#

All notable changes to this project will be documented here.

The format is based on Keep a Changelog.

This project adheres to Semantic Versioning, with respect to the public API: the hyper-parameters, the experiment run function and the language model server API. Version numbers take the form MAJOR.MINOR.PATCH, where bumping each has general the following meanings:

MAJOR: There are backwards-incompatible changes to one of (a) either the hyper-parameters themselves, how they are interpreted; (b) the run function; or (c) the language server API.
MINOR: New hyper-parameters are added in a backwards-compatible way, the run function is changed in a backwards-compatible way, or new language server API endpoints are added. We may also bump the MINOR version on changes to the developer API.
PATCH: A bug has been fixed.

Since the version number is stored with any runs tracked, this allows comparing the compatibility of two runs and checking whether an old run can be resumed with the current codebase. If the older run differs by a MINOR version, its hyper-parameters are guaranteed to be compatible, but not if it differs by a MAJOR version.

Similarly, it ensures that a client and server which agree on MAJOR version are able to communicate reliably.

[2.1.1] - 2025-06-30#

Fixed#

VerifierDecisionSpectrumType now contains the correct spectrum names.
nip.parameters.version.ConversionFunctionNotFoundError called super().__init, when it should be super().__init__

[2.1.0] - 2025-06-27#

Added#

Option to clear the Hugging Face model cache before starting vLLM server, with --vllm-clear-cache.
Option to set the max LoRA rank which can be served by vLLM.
Ability to enable vLLM debug mode.
Ability to set vLLM max LoRA rank automatically.
Agent-level hyper-parameter to enable quantisation for self-hosted models.
Using Liger kernel in DPO training for increased speed and lower memory usage.
Now logging prompt and completion token lengths for self-hosted models.
Now attempting several times to start the vLLM server.
Allowed setting DPO number of epochs.

Changed#

Increased the default timeout for waiting for the vLLM server to load to 15 minutes.
Using the experiment seed for training jobs.
Using the model base name for training job names on LoRA models, rather than the model name itself.
Launching language model server with uvicorn directly (rather than using fastapi run), which has allowed displaying more log messages.
Sanitising model name a little more when creating training job name, which helps Hugging Face identify the base model more easily.
Allowed setting DPO training logging period, and set it to 1 by default.
The cv_experiment.py script now contains a Pydantic model of the default config values. Config files now only need to specify values which differ from the default.
Runs of the cv_experiment.py experiment script now include the name of the config file.

Fixed#

Bug where non-trainable shared model groups were queried for training status when resbumitting a failed fine-tune job.
Rearranged trained model name so Hugging Face can always tell from the name what type of model it is.
Made sure that LoRA layers are trainable.
Limiting W&B job names to 128 characters.
Fixed run metadata file not being saved in the W&B artifact, meaning the run package version wasn’t being determined correctly.

[2.0.0] - 2025-06-16#

Changed#

Renamed ReinforcementLearningTrainer to TensorDictRlTrainer.
Refactored the agent-building part of the factory so that which parts to build are determined by class properties of the trainer classes, rather than by hard-coding the names of the trainers.
Moved the ScenarioInstance dataclass into its own scenario_instance module.
Refactored code validation RolloutAnalyser class hierarchy
Implemented prompt template versions
Switched to using Jinja for prompt templates
Defaulting to not generating multiple responses for frozen agents in MALT (this is now configurable).
The mid-point reward estimate for the verifier is now more comprehensive, taking into account the reward for not guessing.
Renamed the cv_benchmark.py script to benchmark_cv.py.
Renamed the ei_cv.py script to cv_experiment.py, and allowed it run MALT experiments too.
The solo agent trainer now uses the train and test splits provided by the dataset, rather than using its own.
For the code validation scenario, the test dataset is split into “train” and “validation” sets, and testing can now be done on the validation set rather than the test dataset.
When doing MALT, for testing we sample rollouts in the regular way, rather than constructing a MALT tree.
When generating preference pairs for MALT, timesteps are now filtered based on whether an agent has taken an action, rather than on whether they are active according to the protocol handler. This is currently functionally equivalent, but this could change in the future.
The way MALT forests are stored as arrays has changed a little. We now explicitly store the parent ID of a node. The “has_positive_and_negative”, “is_positive”, “sampled_positive_example” and “sampled_negative_example” fields have been removed. Instead the Boolean fields “is_pair_positive” and “is_pair_negative” record the nodes which form the positive-negative pairs for their parents. This is easier to work with and less brittle. As a consequence of this, the stats logged when doing MALT are now completely different.
Logging is now done consistently on the module-level with logging.getLogger(__name__), rather than on an ad hoc basis.
Switched to using async.io for pure-text rollout generation, rather than multiprocessing.Pool.
Renamed InvalidResponseError to UnparsableResponseError.
The cv_experiments.py script now takes a config file as an argument, which specifies the hyper-parameters of the experiment.
The “vLLM-OpenAI” model provider has been renamed to “SelfHosted”, which corresponds to models hosted using the language model server, which uses vLLM for inference and HuggingFace for training.
Switched to uv as the recommended package and project manager.
The Dockerfile now uses CUDA version 12.0.

Added#

A guide to creating a new trainer.
An overview doc on how an experiment is built and run.
Ability to use more models for code validation inference using either vLLM or OpenRouter.
Implemented max_train_size and max_test_size for code validation datasets.
Allowed setting repetition_penalty for code validation agents.
Logging the proportion of rollouts where the verifier does not make a decision, for pure text trainers.
Ability to specify a custom prompt template for the code validation task.
Verifier format conformance rollout analyser: how well does the verifier conform to the required format?
Utilities for downloading and loading checkpoints
The script download_cv_checkpoints.py to download code validation checkpoint files
A utility to compute the decision agreement between rollouts
Option to have the verifier give a decision on a spectrum, rather than a binary accept or reject.
Utilities to compute the histogram of verifier decisions and thresholded performance.
Functionality for appending a ‘supervisor’ message to the chat history before sending it to the model, to help it better follow the instructions.
Enabled analysing a batch of rollouts with scripts/analyse_cv_rollouts.py.
A database of language model metadata. This helps with running a sweep of experiments across multiple models.
The beta parameter can now be specified when doing DPO with the OpenAI API.
Script and utility for visualising MALT rollouts as a forest of trees.
The option to select response pairs for MALT by thresholding the difference in expected reward between the two.
The option to run some rounds of expert iteration before MALT.
The test_dataset_split, which controls which dataset split is used for testing.
Implemented max_test_size for graph isomorphism and image classification datasets.
The _PartialRolloutNode instances generated when building a MALT tree can now be visualised.
Enabled terminating an episode when a prover generates an invalid response, giving them a penalty.
Ability to force a code validation experiment to run for more iterations that its original hyper-parameters.
A self-hosting language model server and client. The server controls a vLLM process and HuggingFace trainer, allowing for convenient self-hosting of open-weight models.
When doing pure-text RL, if any fine-tune job fails the user is now prompted if they want to re-submit the job
The --resume-if-safe flag for cv_experiment.py, which automatically resumes a previous run if the major version numbers agree

Removed#

The test_size hyper-parameter, which is now unused.
The PureTextAgentParameters.vllm_openai_base_url parameter, which is now specified by PureTextAgentParameters.language_model_server_scheme_host and PureTextAgentParameters.vllm_server_port.
The datasets Dockerfile target, which pre-downloaded all datasets

Fixed#

Bug where mean_decision and std_decision were incorrectly logged for pure text trainers.
Bug where one of the provers in the MNIP protocol for code validation got the incorrect rewards, due to a mistake with inheritance.
Design flaw where MALT preference pairs were not generated for the root node.

[1.0.0] - 2025-03-10#

First public release

Changelog

Contents

Changelog#

[2.1.1] - 2025-06-30#

Fixed#

[2.1.0] - 2025-06-27#

Added#

Changed#

Fixed#

[2.0.0] - 2025-06-16#

Changed#

Added#

Removed#

Fixed#

[1.0.0] - 2025-03-10#