Changelog#
All notable changes to this project will be documented here.
The format is based on Keep a Changelog.
This project adheres to Semantic
Versioning, with respect to the
public API: the hyper-parameters, the experiment run function and the
language model server API. Version numbers take the form
MAJOR.MINOR.PATCH
, where bumping each has general the following
meanings:
MAJOR
: There are backwards-incompatible changes to one of (a) either the hyper-parameters themselves, how they are interpreted; (b) the run function; or (c) the language server API.MINOR
: New hyper-parameters are added in a backwards-compatible way, the run function is changed in a backwards-compatible way, or new language server API endpoints are added. We may also bump theMINOR
version on changes to the developer API.PATCH
: A bug has been fixed.
Since the version number is stored with any runs tracked, this allows
comparing the compatibility of two runs and checking whether an old run
can be resumed with the current codebase. If the older run differs by a
MINOR
version, its hyper-parameters are guaranteed to be compatible,
but not if it differs by a MAJOR
version.
Similarly, it ensures that a client and server which agree on MAJOR
version are able to communicate reliably.
[2.1.1] - 2025-06-30#
Fixed#
VerifierDecisionSpectrumType
now contains the correct spectrum names.nip.parameters.version.ConversionFunctionNotFoundError
calledsuper().__init
, when it should besuper().__init__
[2.1.0] - 2025-06-27#
Added#
Option to clear the Hugging Face model cache before starting vLLM server, with
--vllm-clear-cache
.Option to set the max LoRA rank which can be served by vLLM.
Ability to enable vLLM debug mode.
Ability to set vLLM max LoRA rank automatically.
Agent-level hyper-parameter to enable quantisation for self-hosted models.
Using Liger kernel in DPO training for increased speed and lower memory usage.
Now logging prompt and completion token lengths for self-hosted models.
Now attempting several times to start the vLLM server.
Allowed setting DPO number of epochs.
Changed#
Increased the default timeout for waiting for the vLLM server to load to 15 minutes.
Using the experiment seed for training jobs.
Using the model base name for training job names on LoRA models, rather than the model name itself.
Launching language model server with
uvicorn
directly (rather than usingfastapi run
), which has allowed displaying more log messages.Sanitising model name a little more when creating training job name, which helps Hugging Face identify the base model more easily.
Allowed setting DPO training logging period, and set it to 1 by default.
The
cv_experiment.py
script now contains a Pydantic model of the default config values. Config files now only need to specify values which differ from the default.Runs of the
cv_experiment.py
experiment script now include the name of the config file.
Fixed#
Bug where non-trainable shared model groups were queried for training status when resbumitting a failed fine-tune job.
Rearranged trained model name so Hugging Face can always tell from the name what type of model it is.
Made sure that LoRA layers are trainable.
Limiting W&B job names to 128 characters.
Fixed run metadata file not being saved in the W&B artifact, meaning the run package version wasn’t being determined correctly.
[2.0.0] - 2025-06-16#
Changed#
Renamed
ReinforcementLearningTrainer
toTensorDictRlTrainer
.Refactored the agent-building part of the factory so that which parts to build are determined by class properties of the trainer classes, rather than by hard-coding the names of the trainers.
Moved the
ScenarioInstance
dataclass into its ownscenario_instance
module.Refactored code validation
RolloutAnalyser
class hierarchyImplemented prompt template versions
Switched to using Jinja for prompt templates
Defaulting to not generating multiple responses for frozen agents in MALT (this is now configurable).
The mid-point reward estimate for the verifier is now more comprehensive, taking into account the reward for not guessing.
Renamed the
cv_benchmark.py
script tobenchmark_cv.py
.Renamed the
ei_cv.py
script tocv_experiment.py
, and allowed it run MALT experiments too.The solo agent trainer now uses the train and test splits provided by the dataset, rather than using its own.
For the code validation scenario, the test dataset is split into “train” and “validation” sets, and testing can now be done on the validation set rather than the test dataset.
When doing MALT, for testing we sample rollouts in the regular way, rather than constructing a MALT tree.
When generating preference pairs for MALT, timesteps are now filtered based on whether an agent has taken an action, rather than on whether they are active according to the protocol handler. This is currently functionally equivalent, but this could change in the future.
The way MALT forests are stored as arrays has changed a little. We now explicitly store the parent ID of a node. The “has_positive_and_negative”, “is_positive”, “sampled_positive_example” and “sampled_negative_example” fields have been removed. Instead the Boolean fields “is_pair_positive” and “is_pair_negative” record the nodes which form the positive-negative pairs for their parents. This is easier to work with and less brittle. As a consequence of this, the stats logged when doing MALT are now completely different.
Logging is now done consistently on the module-level with
logging.getLogger(__name__)
, rather than on an ad hoc basis.Switched to using
async.io
for pure-text rollout generation, rather thanmultiprocessing.Pool
.Renamed
InvalidResponseError
toUnparsableResponseError
.The
cv_experiments.py
script now takes a config file as an argument, which specifies the hyper-parameters of the experiment.The “vLLM-OpenAI” model provider has been renamed to “SelfHosted”, which corresponds to models hosted using the language model server, which uses vLLM for inference and HuggingFace for training.
Switched to
uv
as the recommended package and project manager.The Dockerfile now uses CUDA version 12.0.
Added#
A guide to creating a new trainer.
An overview doc on how an experiment is built and run.
Ability to use more models for code validation inference using either vLLM or OpenRouter.
Implemented
max_train_size
andmax_test_size
for code validation datasets.Allowed setting
repetition_penalty
for code validation agents.Logging the proportion of rollouts where the verifier does not make a decision, for pure text trainers.
Ability to specify a custom prompt template for the code validation task.
Verifier format conformance rollout analyser: how well does the verifier conform to the required format?
Utilities for downloading and loading checkpoints
The script
download_cv_checkpoints.py
to download code validation checkpoint filesA utility to compute the decision agreement between rollouts
Option to have the verifier give a decision on a spectrum, rather than a binary accept or reject.
Utilities to compute the histogram of verifier decisions and thresholded performance.
Functionality for appending a ‘supervisor’ message to the chat history before sending it to the model, to help it better follow the instructions.
Enabled analysing a batch of rollouts with
scripts/analyse_cv_rollouts.py
.A database of language model metadata. This helps with running a sweep of experiments across multiple models.
The beta parameter can now be specified when doing DPO with the OpenAI API.
Script and utility for visualising MALT rollouts as a forest of trees.
The option to select response pairs for MALT by thresholding the difference in expected reward between the two.
The option to run some rounds of expert iteration before MALT.
The
test_dataset_split
, which controls which dataset split is used for testing.Implemented
max_test_size
for graph isomorphism and image classification datasets.The
_PartialRolloutNode
instances generated when building a MALT tree can now be visualised.Enabled terminating an episode when a prover generates an invalid response, giving them a penalty.
Ability to force a code validation experiment to run for more iterations that its original hyper-parameters.
A self-hosting language model server and client. The server controls a vLLM process and HuggingFace trainer, allowing for convenient self-hosting of open-weight models.
When doing pure-text RL, if any fine-tune job fails the user is now prompted if they want to re-submit the job
The
--resume-if-safe
flag forcv_experiment.py
, which automatically resumes a previous run if the major version numbers agree
Removed#
The
test_size
hyper-parameter, which is now unused.The
PureTextAgentParameters.vllm_openai_base_url
parameter, which is now specified byPureTextAgentParameters.language_model_server_scheme_host
andPureTextAgentParameters.vllm_server_port
.The
datasets
Dockerfile target, which pre-downloaded all datasets
Fixed#
Bug where
mean_decision
andstd_decision
were incorrectly logged for pure text trainers.Bug where one of the provers in the MNIP protocol for code validation got the incorrect rewards, due to a mistake with inheritance.
Design flaw where MALT preference pairs were not generated for the root node.
[1.0.0] - 2025-03-10#
First public release