nip.rl_objectives.SpgLoss#
- class nip.rl_objectives.SpgLoss(*args, **kwargs)[source]#
Loss for Stackelberg Policy Gradient and several variants.
In contrast to other objectives, the
forward
method returns the gains per agent and the sum of the log probabilities separately. These must be combined later to compute the true loss. This is because we need to compute the gradients of these separately.The following variants are supported:
SPG: Standard Stackelberg Policy Gradient [FCR20].
PSPG: SPG with the clipped PPO loss.
LOLA: The Learning with Opponent-Learning Awareness algorithm [FCAS+18].
POLA: LOLA with the clipped PPO loss.
SOS: The Stable Opponent Shaping algorithm [LFB+19].
PSOS: SOS with the clipped PPO loss.
Methods Summary
__init__
(actor, critic, variant, ...[, ...])Initialize internal Module state, shared by both nn.Module and ScriptModule.
_compute_ihvps
(loss_vals)Compute the inverse Hessian vector products for each agent and follower.
_compute_jacobian_terms
(leader_name, ...)Compute the score and policy gradient coefficients for the Jacobian terms.
Get the parameters for each agent as dictionaries with the parameter names.
_get_advantage
(tensordict)Get the advantage for a tensordict, normalising it if required.
Get and zero the gradients for the parameters of all the agents.
Get dictionaries of the followers of each agent.
_log_weight
(sample)Compute the log weight for the given TensorDict sample.
_loss_critic
(tensordict)Get the critic loss without the clip fraction.
_set_entropy_and_critic_losses
(tensordict, ...)Set the entropy and critic losses in the output TensorDict.
_set_ess
(num_batch_dims, td_out, log_weight)Set the ESS in the output TensorDict, for logging.
backward
(loss_vals)Compute and assign the gradients of the loss for each agent.
forward
(tensordict)Compute the loss for the Stackelberg Policy Gradient algorithm.
set_keys
(**kwargs)Set the keys of the input TensorDict that are used by this loss.
Attributes
SEP
TARGET_NET_WARNING
T_destination
_cached_critic_network_params_detached
_clip_bounds
action_keys
call_super_init
default_keys
default_value_estimator
dump_patches
functional
in_keys
out_keys
out_keys_source
stackelberg_sequence_flat
A flattened version of the Stackelberg sequence.
tensor_keys
value_estimator
The value function blends in the reward and value estimate(s) from upcoming state(s)/state-action pair(s) into a target value estimate for the value network.
vmap_randomness
training
Methods
- __init__(actor: TensorDictModule, critic: TensorDictModule, variant: Literal['spg', 'pspg', 'lola', 'pola', 'sos', 'psos'], stackelberg_sequence: list[tuple[str, ...]], agent_names: list[str], agents: dict[str, Agent], ihvp_arguments: dict, additional_lola_term: bool, sos_scaling_factor: float, sos_threshold_factor: float, agent_lr_factors: dict[str, LrFactors | dict | None], lr: float, clip_epsilon: float, entropy_coef: float, normalize_advantage: bool, loss_critic_type: str, clip_value: bool | float | None, device: device, functional: bool = True)[source]#
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- _compute_ihvps(loss_vals: TensorDictBase) dict[tuple[str, str], dict[str, Tensor]] [source]#
Compute the inverse Hessian vector products for each agent and follower.
This is the inverse Hessian of the follower’s loss with respect to the follower’s parameters multiplied by the gradient of the leader’s loss with respect to the follower’s parameters.
- Parameters:
loss_vals (TensorDictBase) – The loss values.
- Returns:
ihvps (dict[tuple[str, str], dict[str, Tensor]]) – A dictionary where the keys are the agent and follower names, and the values are dictionaries of the inverse Hessian vector products for each of the follower’s parameters.
- _compute_jacobian_terms(leader_name: str, follower_name: str, objective_loss_grads: dict[tuple[str, str], dict[str, Tensor]], scores: dict[str, dict[str, Tensor]]) tuple[dict[str, Tensor], dict[str, Tensor]] [source]#
Compute the score and policy gradient coefficients for the Jacobian terms.
Recursive function to compute elements of the Jacobian (of agent 2’s loss with respect to agent 2’s parameters then agent 1’s parameters) using the chain rule – we maintain separate coefficients for the score term and the policy gradient term to avoid computing full Jacobian matrices.
Consider a leader agent \(g\) and a follower agent \(f\). Let \(\nabla_j \Ell_i\) be the gradient loss of agent \(i\) with respect to the parameters of agent \(j\) and \(\bar S_i\) be the score for agent \(i\), which is the gradient of the log sum probability of the actions with respect to the parameters of agent \(i\). This function computes \(C_S(g, f)\) and \(C_{PG}(g, f)\), the coefficients for the score and policy gradient terms in the Jacobian of agent \(f\)’s loss with respect to agent \(g\)’s parameters.
When \(f\) follows \(g\) immediately in the Stackelberg sequence, the score coefficient is:
\[C_S(g, f) = \nabla_g \Ell_f\]and the policy gradient coefficient is:
\[C_{PG}(g, f) = \bar S_g\]Otherwise, we recursively compute the Jacobian terms for the leader and immediate leader \(g'\) of \(f\). Let \(Q\) be the set of immediate leaders of \(f\). Then the score coefficient is:
\[C_S(g, f) = \sum_{g' \in Q} ( (\nabla_{g'} \Ell_f \cdot S_{g'}) C_S(g', f) + (\nabla_{g'} \Ell_f \cdot \nabla_{g'} \Ell_{g'}) C_{PG}(g', f) )\]- Parameters:
leader_name (str) – The name of the leader agent, which comes before the follower agent in the stackelberg_sequence.
follower_name (str) – The name of the follower agent.
objective_loss_grads (dict[tuple[str, str], dict[str, Tensor]]) – The gradients of the objective loss for each agent with respect to each agent’s parameters. The first index is the agent whose loss it is, and the second index is the agent whose parameters it is with respect to.
scores (dict[str, dict[str, Tensor]]) – The scores for each agent.
- Returns:
score_coefficient (dict[str, Tensor]) – A dictionary of the coefficients for the score term for each of the leader’s parameters.
pg_coefficient (dict[str, Tensor]) – A dictionary of the coefficients for the policy gradient term for each of the leader’s parameters.
- _get_actor_params() dict[str, dict[str, Parameter]] [source]#
Get the parameters for each agent as dictionaries with the parameter names.
- Returns:
actor_params (dict[str, dict[str, Parameter]]) – A dictionary whose keys are the agent names and whose values are dictionaries where the keys are the parameter names and the values are the parameters.
- _get_advantage(tensordict: TensorDictBase) Tensor [source]#
Get the advantage for a tensordict, normalising it if required.
- Parameters:
tensordict (TensorDictBase) – The input TensorDict.
- Returns:
advantage (torch.Tensor) – The normalised advantage.
- _get_and_zero_all_grads() dict[str, dict[str, Tensor]] [source]#
Get and zero the gradients for the parameters of all the agents.
- Returns:
grads (dict[str, dict[str, Tensor]]) – A dictionary where the keys are agent names and the values are dictionaries where the keys are the parameter names and the values are the gradients.
- _get_followers() tuple[dict[str, tuple[str, ...]], dict[str, tuple[str, ...]]] [source]#
Get dictionaries of the followers of each agent.
For each agent in the Stackelberg sequence, we get the agents in the group immediately following them, as well as all the agents in the groups following them. This is returned as two dictionaries.
- Returns:
immediate_followers (dict[str, tuple[str, …]]) – A dictionary where the keys are agent names and the values are tuples of agent names for the immediate followers of each agent.
descendent_followers (dict[str, tuple[str, …]]) – A dictionary where the keys are agent names and the values are tuples of agent names for all the followers of each agent (i.e. the immediate followers, as well as all the followers of the immediate followers, and so on).
- _log_weight(sample: TensorDictBase) tuple[Tensor, Tensor, Distribution] [source]#
Compute the log weight for the given TensorDict sample.
- Parameters:
sample (TensorDictBase) – The sample TensorDict.
- Returns:
log_prob (torch.Tensor) – The log probabilities of the sample
log_weight (torch.Tensor) – The log weight of the sample
dist (torch.distributions.Distribution) – The distribution used to compute the log weight.
- _loss_critic(tensordict: TensorDictBase) Tensor [source]#
Get the critic loss without the clip fraction.
TorchRL’s
loss_critic
method returns a tuple with the critic loss and the clip fraction. This method returns only the critic loss.
- _set_entropy_and_critic_losses(tensordict: TensorDictBase, td_out: TensorDictBase, dist: CompositeCategoricalDistribution)[source]#
Set the entropy and critic losses in the output TensorDict.
- Parameters:
tensordict (TensorDictBase) – The input TensorDict.
td_out (TensorDictBase) – The output TensorDict, which will be modified in place.
dist (CompositeCategoricalDistribution) – The distribution used to compute the log weight.
- _set_ess(num_batch_dims: int, td_out: TensorDictBase, log_weight: Tensor)[source]#
Set the ESS in the output TensorDict, for logging.
- Parameters:
num_batch_dims (int) – The number of batch dimensions.
td_out (TensorDictBase) – The output TensorDict, which will be modified in place.
log_weight (Tensor) – The log weights.
- backward(loss_vals: TensorDictBase)[source]#
Compute and assign the gradients of the loss for each agent.
- Parameters:
loss_vals (TensorDictBase) – The loss values.
- forward(tensordict: TensorDictBase) TensorDictBase [source]#
Compute the loss for the Stackelberg Policy Gradient algorithm.
- Parameters:
tensordict (TensorDictBase) – The input TensorDict.
- Returns:
td_out (TensorDictBase) – The output TensorDict containing the losses.
- set_keys(**kwargs)[source]#
Set the keys of the input TensorDict that are used by this loss.
The keyword argument ‘action’ is treated specially. This should be an iterable of action keys. These are not validated against the set of accepted keys for this class. Instead, each is added to the set of accepted keys.
All other keyword arguments should match
self._AcceptedKeys
.- Parameters:
**kwargs – The keyword arguments to set.