Running Experiments (nip.run
)#
Basic Workflow#
An experiment is built and run using the run_experiment()
function. This takes as input a HyperParameters
object, as well as various configuration options. The
basic workflow is as follows:
Create a
HyperParameters
object. This specifies all the parameters for the experiment. In theory an experiment should be completely reproducible from its hyper-parameters (in practice, things like hardware quirks prevent this).Call
run_experiment()
with the hyper-parameters object and other configuration options. These options specify things like the device to run on, and whether to save the results to Weights & Biases. These additional options should not affect the experiment’s outcome (in theory).
The run_experiment()
function takes care of setting up
all the experiment components, running the experiment, and saving the results. It is
designed to be as simple as possible, while still allowing for a wide range of
experiments.
See Running Experiments for a more detailed guide on running experiments.
Example#
Run a graph isomorphism experiment using PPO, with a few custom parameters:
from nip import run_experiment
from nip.parameters import HyperParameters, AgentsParams, GraphIsomorphismAgentParameters
hyper_params = HyperParameters(
scenario="graph_isomorphism",
trainer="ppo",
dataset="eru10000",
agents=AgentsParams(
prover=GraphIsomorphismAgentParameters(d_gnn=128),
verifier=GraphIsomorphismAgentParameters(num_gnn_layers=2),
),
)
run_experiment(hyper_params, device="cuda", use_wandb=True, run_id="my_run")
Preparing Experiments#
If you are running multiple experiments (e.g. with a hyper-parameter sweep), it can be convenient to do some preparation in advance, such as downloading datasets. This is especially important if experiments are run in parallel, as downloading the same dataset multiple times can be slow, wasteful, and potentially lead to errors.
The prepare_experiment()
function is designed to
help with this. It takes a HyperParameters
object and simulates building all experiment components, without actually running the
experiment. It also returns some information about the experiment, such as the total
number of steps taken by the trainer (useful for progress bars).
Module Contents#
- nip.run.run_experiment(hyper_params: ~nip.parameters.HyperParameters, device: ~torch.device | str | int = 'cpu', logger: ~logging.Logger | ~logging.LoggerAdapter | None = None, profiler: ~torch.profiler.profiler.profile | None = None, tqdm_func: callable = <class 'tqdm.std.tqdm'>, ignore_cache: bool = False, use_wandb: bool = False, wandb_project: str | None = None, wandb_entity: str | None = None, run_id: str | None = None, allow_auto_generated_run_id: bool = False, allow_resuming_wandb_run: bool = False, allow_overriding_wandb_config: bool = False, print_wandb_run_url: bool = False, wandb_tags: list = [], wandb_group: str | None = None, num_dataset_threads: int = 8, num_rollout_workers: int = 4, pin_memory: bool = True, dataset_on_device: bool = False, enable_efficient_attention: bool = False, global_tqdm_step_fn: callable = <function <lambda>>, test_run: bool = False)[source]#
Build and run an experiment.
Builds the experiment components according to the parameters and runs the experiment.
- Parameters:
hyper_params (HyperParameters) – The parameters of the experiment. Note that the actual parameters used in the experiment may differ from
hyper_params
if a base run is used.device (TorchDevice, default="cpu") – The device to use for training.
logger (logging.Logger | logging.LoggerAdapter, optional) – The logger to log to. If None, the trainer will create a logger.
profiler (torch.profiler.profile, optional) – The PyTorch profiler being used to profile the training, if any.
tqdm_func (Callable, optional) – The tqdm function to use. Defaults to tqdm.
ignore_cache (bool, default=False) – If True, the dataset and model cache are ignored and rebuilt.
use_wandb (bool, default=False) – If True, log the experiment to Weights & Biases.
wandb_project (str, optional) – The name of the W&B project to log to. If None, the default project is used.
wandb_entity (str, optional) – The name of the W&B entity to log to. If None, the default entity is used.
run_id (str, optional) – The ID of the run. Required if use_wandb is True and allow_auto_generated_run_id is False.
allow_auto_generated_run_id (bool, default=False) – If True, the run ID can be auto-generated if not specified.
allow_resuming_wandb_run (bool, default=False) – If True, the run can be resumed if the run ID is specified and the run exists.
allow_overriding_wandb_config (bool, default=False) – If True, the W&B config can be overridden when resuming a run.
print_wandb_run_url (bool, default=False) – If True, print the URL of the W&B run at the start of the experiment.
wandb_tags (list[str], default=[]) – The tags to add to the W&B run.
wandb_group (str, optional) – The name of the W&B group for the run. Runs with the same group are placed together in the UI. This is useful for doing multiple runs on the same machine.
num_dataset_threads (int, default=8) – The number of threads to use for saving the memory-mapped tensordict.
num_rollout_workers (int, default=4) – The number of workers to use for collecting rollout samples, when this is done in parallel.
pin_memory (bool, default=True) – Whether to pin the memory of the tensors in the dataloader, and move them to the GPU with
non_blocking=True
. This can speed up training.dataset_on_device (bool, default=False) – Whether store the whole dataset on the device. This can speed up training but requires that the dataset fits on the device. This makes
pin_memory
redundant.enable_efficient_attention (bool, default=False) – Whether to enable the ‘ Memory-Efficient Attention’ backend for the scaled dot-product attention. There may be a bug in this implementation which causes NaNs to appear in the backward pass. See pytorch/pytorch#119320 for more information.
global_tqdm_step_fn (Callable, default=lambda: ...) – A function to step the global tqdm progress bar. This is used when there are multiple processes running in parallel and each process needs to update the global progress bar.
test_run (bool, default=False) – If True, the experiment is run in test mode. This means we do the smallest number of iterations possible and then exit. This is useful for testing that the experiment runs without errors.
- nip.run.prepare_experiment(hyper_params: HyperParameters, profiler: profile | None = None, ignore_cache: bool = False, num_dataset_threads: int = 8, device: device | str | int | None = None, test_run: bool = False) PreparedExperimentInfo [source]#
Prepare for running an experiment.
This is useful, e.g., for downloading data before running an experiment. Without this, if running multiple experiments in parallel, the initial runs will all start downloading data at the same time, which can cause problems.
- Parameters:
hyper_params (HyperParameters) – The parameters of the experiment.
profiler (torch.profiler.profile, optional) – The PyTorch profiler being used to profile the training, if any.
ignore_cache (bool, default=False) – If True, when the dataset is loaded, the cache is ignored and the dataset is rebuilt from the raw data.
num_dataset_threads (int, default=8) – The number of threads to use for saving the memory-mapped tensordict.
device (TorchDevice, optional) – The device to use for training. If None, the GPU is used if available, otherwise the CPU is used.
test_run (bool, default=False) – If True, the experiment is run in test mode. This means we do the smallest number of iterations possible and then exit. This is useful for testing that the experiment runs without errors.
- Returns:
prepared_experiment_info (PreparedExperimentInfo) – Information about the prepared experiment.