Algorithms¶

All algorithm are derived from nnabla_rl.algorithm.Algorithm.

Note

Algorithm will run on cpu by default (No matter what nnabla context is set in prior to the instantiation). If you want to run the algorithm on gpu, set the gpu_id through the algorithm’s config. Note that the algorithm will override the nnabla context when the training starts.

Algorithm¶

class nnabla_rl.algorithm.AlgorithmConfig(gpu_id: int = -1)[source]¶

List of algorithm common configuration.

Parameters:: gpu_id (int) – id of the gpu to use. If negative, the training will run on cpu. Defaults to -1.

class nnabla_rl.algorithm.Algorithm(env_info, config=AlgorithmConfig(gpu_id=-1))[source]¶

Base Algorithm class.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – : environment or environment info
config (AlgorithmConfig) – configuration of the algorithm

Note

Default functions, solvers and configurations are set to the configurations of each algorithm’s original paper. Default functions may not work depending on the environment.

abstract compute_eval_action(state, *, begin_of_episode=False, extra_info={}) → ndarray[source]¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

compute_trajectory(initial_trajectory: Sequence[Tuple[ndarray, ndarray | None]]) → Tuple[Sequence[Tuple[ndarray, ndarray | None]], Sequence[Dict[str, Any]]][source]¶

Compute trajectory (sequence of state and action tuples) from given initial trajectory using current policy. Most of the reinforcement learning algorithms does not implement this method. Only the optimal control algorithms implements this method.

Parameters:: initial_trajectory (Sequence[Tuple[np.ndarray, Optional[np.ndarray]]]) – initial trajectory.
Returns:: Sequence of state and action tuples and extra information (if exist) at each timestep, computed with current best policy. Extra information depends on the algorithm. The sequence length is same as the length of initial trajectory.
Return type:: Tuple[Sequence[Tuple[np.ndarray, Optional[np.ndarray]]], Sequence[Dict[str, Any]]]

classmethod is_rnn_supported() → bool[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

abstract classmethod is_supported_env(env_or_env_info: Env | EnvironmentInfo) → bool[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property iteration_num: int¶

Current iteration number.

Returns:: Current iteration number of running training.
Return type:: int

property latest_iteration_state: Dict[str, Any]¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

set_hooks(hooks: Sequence[Hook])[source]¶

Set hooks for running additional operation during training. Previously set hooks will be removed and replaced with new hooks.

Parameters:: hooks (list of nnabla_rl.hook.Hook) – Hooks to invoke during training

train(env_or_buffer: Env | ReplayBuffer, total_iterations: int = 9223372036854775807)[source]¶

Train the policy with reinforcement learning algorithm.

Parameters:

env_or_buffer (Union[gym.Env, ReplayBuffer]) – Target environment to train the policy online or reply buffer to train the policy offline.
total_iterations (int) – Total number of iterations to train the policy.

Raises:

UnsupportedTrainingException – Raises if this algorithm does not support the training method for given parameter.

train_offline(replay_buffer: ReplayBuffer, total_iterations: int = 9223372036854775807)[source]¶

Train the policy using only the replay buffer.

Parameters:

replay_buffer (ReplayBuffer) – Replay buffer to sample experiences to train the policy.
total_iterations (int) – Total number of iterations to train the policy.

Raises:

UnsupportedTrainingException – Raises if the algorithm does not support offline training

train_online(train_env: Env, total_iterations: int = 9223372036854775807)[source]¶

Train the policy by interacting with given environment.

Parameters:

train_env (gym.Env) – Target environment to train the policy.
total_iterations (int) – Total number of iterations to train the policy.

Raises:

UnsupportedTrainingException – Raises if the algorithm does not support online training

A2C¶

class nnabla_rl.algorithms.a2c.A2CConfig(gpu_id: int = -1, gamma: float = 0.99, n_steps: int = 5, learning_rate: float = 0.0007, entropy_coefficient: float = 0.01, value_coefficient: float = 0.5, decay: float = 0.99, epsilon: float = 1e-05, start_timesteps: int = 1, actor_num: int = 8, timelimit_as_terminal: bool = False, max_grad_norm: float | None = 0.5, seed: int = -1, learning_rate_decay_iterations: int = 50000000)[source]¶

Bases: AlgorithmConfig

List of configurations for A2C algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
n_steps (int) – number of rollout steps. Defaults to 5.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0007.
entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.
value_coefficient (float) – scalar of value loss. Defaults to 0.5.
decay (float) – decay parameter of Adam solver. Defaults to 0.99.
epsilon (float) – epislon of Adam solver. Defaults to 0.00001.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 1.
actor_num (int) – number of parallel actors. Defaults to 8.
timelimit_as_terminal (bool) – Treat as done if the environment reaches the timelimit. Defaults to False.
max_grad_norm (Optional[float]) – threshold value for clipping gradient. Defaults to 0.5.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.
learning_rate_decay_iterations (int) – learning rate will be decreased lineary to 0 till this iteration number. If 0 or negative, learning rate will be kept fixed. Defaults to 50000000.

class nnabla_rl.algorithms.a2c.A2C(env_or_env_info, v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.a2c.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.a2c.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, config=A2CConfig(gpu_id=-1, gamma=0.99, n_steps=5, learning_rate=0.0007, entropy_coefficient=0.01, value_coefficient=0.5, decay=0.99, epsilon=1e-05, start_timesteps=1, actor_num=8, timelimit_as_terminal=False, max_grad_norm=0.5, seed=-1, learning_rate_decay_iterations=50000000))[source]¶

Bases: Algorithm

Advantage Actor-Critic (A2C) algorithm implementation.

This class implements the Advantage Actor-Critic (A2C) algorithm. A2C is the synchronous version of A3C, Asynchronous Advantage Actor-Critic. A3C was proposed by V. Mnih, et al. in the paper: “Asynchronous Methods for Deep Reinforcement Learning” For detail see: https://arxiv.org/abs/1602.01783

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
policy_solver_builder (SolverBuilder) – builder for policy solvers
config (A2CConfig) – configuration of A2C algorithm

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

AMP¶

class nnabla_rl.algorithms.amp.AMPConfig(gpu_id: int = -1, gamma: float = 0.95, lmb: float = 0.95, policy_learning_rate: float = 2e-06, policy_momentum: float = 0.9, policy_weight_decay: float = 0.0005, action_bound_loss_coefficient: float = 10.0, epsilon: float = 0.2, v_function_learning_rate: float = 0.0005, v_function_momentum: float = 0.9, normalized_advantage_clip: Tuple[float, float] = (-4.0, 4.0), value_at_task_fail: float = 0.0, value_at_task_success: float = 1.0, target_value_clip: Tuple[float, float] = (0.0, 1.0), epochs: int = 1, actor_num: int = 16, batch_size: int = 256, actor_timesteps: int = 4096, max_explore_steps: int = 200000000, final_explore_rate: float = 0.2, timelimit_as_terminal: bool = False, preprocess_state: bool = False, state_mean_initializer: Tuple[float | Tuple[float, ...], ...] | None = None, state_var_initializer: Tuple[float | Tuple[float, ...], ...] | None = None, num_processor_samples: int = 1000000, normalize_action: bool = False, action_mean: Tuple[float, ...] | None = None, action_var: Tuple[float, ...] | None = None, discriminator_learning_rate: float = 1e-05, discriminator_momentum: float = 0.9, discriminator_weight_decay: float = 0.0005, discriminator_extra_regularization_coefficient: float = 0.05, discriminator_extra_regularization_variable_names: Tuple[str] = ('logits/affine/W',), discriminator_gradient_penelty_coefficient: float = 10.0, discriminator_gradient_penalty_indexes: Tuple[int, ...] | None = (1,), discriminator_batch_size: int = 256, discriminator_epochs: int = 2, discriminator_reward_scale: float = 2.0, discriminator_agent_replay_buffer_size: int = 100000, use_reward_from_env: bool = False, lerp_reward_coefficient: float = 0.5, act_deterministic_in_eval: bool = True, seed: int = 1)[source]¶

Bases: AlgorithmConfig

List of configurations for Adversarial Motion Priors (AMP) algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.95.
lmb (float) – scalar of lambda return’s computation in GAE. Defaults to 0.95.
policy_learning_rate (float) – learning rate which is set to policy solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.000002.
policy_momentum (float) – learning momentum which is set to policy solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.9.
policy_weight_decay (float) – coefficient for weight decay of policy function parameters. In AMP, weight decay is only applied to non bias parameters. Defaults to 0.0005.
action_bound_loss_coefficient (float) – coefficient of action bound loss. Defaults to 10.0
epsilon (float) – probability ratio clipping range of ppo style policy update. Defaults to 0.2
v_function_learning_rate (float) – learning rate which is set to policy solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0005.
v_function_momentum (float) – learning momentum which is set to value function solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.9.
normalized_advantage_clip (Tuple[float, float]) – clipping value for estimated advantages. This clipping is applied after a normalization. Defaults to (-4.0, 4.0)
value_at_task_fail (float) – value for a task fail state. We overwrite the value of the state by this value when computing the value targets. Defaults to 0.0.
value_at_task_success (float) – value for a task success state. We overwrite the value of the state by this value when computing the value targets. Defaults to 1.0.
target_value_clip (Tuple[float, float]) – clipping value for estimated value targets. Defaults to (0.0, 1.0).
epochs (int) – number of epochs to perform in each training iteration for policy and value function. Defaults to 1.
actor_num (int) – number of parallel actors. Defaults to 16.
batch_size (int) – training batch size for policy and value function. Defaults to 256.
actor_timesteps (int) – number of timesteps to interact with the environment by the actors. Defaults to 4096.
max_explore_steps (int) – number of maximum environment exploring steps. Defaults to 200000000.
final_explore_rate (float) – final rate of the environment explorer. Defaults to 0.2.
timelimit_as_terminal (bool) –
Treat as done if the environment reaches the timelimit. Defaults to False.
preprocess_state (bool) – enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to False.
state_mean_initializer (Optional[Tuple[Union[float, Tuple[float, ...]], ...]]) – mean initialize value for the state preprocessor. Defaults to None.
value (state_var_initialize) – variance initializer for the state preprocessor. Defaults to None.
num_processor_samples (int) – number of timesteps for updating the state preprocessor. Defaults to 1000000.
normalize_action (bool) – enable preprocessing the actions. Defaults to False.
action_mean (Optional[Tuple[float, ...]]) –
action_var (Optional[Tuple[float, ...]]) – variance for the action normalization. Defaults to None.
discriminator_learning_rate (float) – learning rate which is set to discriminator solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00001.
discriminator_momentum (float) – learning momentum which is set to discriminator solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.9.
discriminator_weight_decay (float) – coefficient for weight decay of value function parameters. In AMP, weight decay is only applied to non bias parameters. Defaults to 0.0005.
discriminator_extra_regularization_coefficient (float) – coefficient value of extra regularization of discriminator function parameters that are defined in discriminator_extra_regularization_variable_names. Defaults to 0.05.
discriminator_extra_regularization_variable_names (Tuple[str]) – variable names for applying extra regularization. Defaults to (“logits/affine/W”,).
discriminator_gradient_penelty_coefficient (float) – coefficient value of gradient penalty. See equation (8) in AMP paper. Defaults to 10.0.
discriminator_gradient_penalty_indexes (Optional[Tuple[int, ...]]) – state index number for applying gradient penalty. Defaults to (1,).
discriminator_batch_size (int) – training batch size for discriminator function Defaults to 256.
discriminator_epochs (int) – number of epochs to perform in each training iteration for discriminator function. Defaults to 2.
discriminator_reward_scale (float) – reward scale. This value will multiply the output reward from the discriminator. Defaults to 2.0.
discriminator_replay_buffer_size (int) – replay buffer size for discriminator training. Defaults to 100000.
use_reward_from_env (bool) – enable to use task reward (i.e., reward from the environment). Defaults to False.
lerp_reward_coefficient (float) – coefficient value for lerping the reward from the environment and the reward from the discriminator. Defaults to 0.5
act_deterministic_in_eval (bool) – enable act deterministically at evalution. Defaults to True.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.

class nnabla_rl.algorithms.amp.AMP(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.amp.AMPConfig = AMPConfig(gpu_id=-1, gamma=0.95, lmb=0.95, policy_learning_rate=2e-06, policy_momentum=0.9, policy_weight_decay=0.0005, action_bound_loss_coefficient=10.0, epsilon=0.2, v_function_learning_rate=0.0005, v_function_momentum=0.9, normalized_advantage_clip=(-4.0, 4.0), value_at_task_fail=0.0, value_at_task_success=1.0, target_value_clip=(0.0, 1.0), epochs=1, actor_num=16, batch_size=256, actor_timesteps=4096, max_explore_steps=200000000, final_explore_rate=0.2, timelimit_as_terminal=False, preprocess_state=False, state_mean_initializer=None, state_var_initializer=None, num_processor_samples=1000000, normalize_action=False, action_mean=None, action_var=None, discriminator_learning_rate=1e-05, discriminator_momentum=0.9, discriminator_weight_decay=0.0005, discriminator_extra_regularization_coefficient=0.05, discriminator_extra_regularization_variable_names=('logits/affine/W', ), discriminator_gradient_penelty_coefficient=10.0, discriminator_gradient_penalty_indexes=(1, ), discriminator_batch_size=256, discriminator_epochs=2, discriminator_reward_scale=2.0, discriminator_agent_replay_buffer_size=100000, use_reward_from_env=False, lerp_reward_coefficient=0.5, act_deterministic_in_eval=True, seed=1), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.amp.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.amp.DefaultVFunctionSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.amp.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.amp.DefaultPolicySolverBuilder object>, reward_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.reward_function.RewardFunction] = <nnabla_rl.algorithms.amp.DefaultRewardFunctionBuilder object>, reward_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.amp.DefaultRewardFunctionSolverBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = None, env_explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.amp.DefaultExplorerBuilder object>, discriminator_replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.amp.DefaultReplayBufferBuilder object>)[source]¶

Bases: Algorithm

Adversarial Motion Prior (AMP) implementation.

This class implements the Adversarial Motion Prior (AMP) algorithm proposed by Xue Bin Peng, et al. in the paper: “AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control” For detail see: https://arxiv.org/abs/2104.02180

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info.
config (AMPConfig) – configuration of AMP algorithm.
v_function_builder (ModelBuilder[VFunction]) – builder of v function models.
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models.
policy_solver_builder (SolverBuilder) – builder for policy solvers
reward_function_builder (ModelBuilder[RewardFunction]) – builder of reward function models.
reward_solver_builder (SolverBuilder) – builder for reward function solvers.
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states.
env_explorer_builder (ExplorerBuilder) – builder of environment explorer.
discriminator_replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer of discriminator.

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

ATRPO¶

class nnabla_rl.algorithms.atrpo.ATRPOConfig(gpu_id: int = -1, lmb: float = 0.95, num_steps_per_iteration: int = 5000, pi_batch_size: int = 5000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, backtrack_coefficient: float = 0.8, conjugate_gradient_damping: float = 0.01, conjugate_gradient_iterations: int = 10, vf_epochs: int = 5, vf_batch_size: int = 64, vf_learning_rate: float = 0.00030000000000000003, vf_l2_reg_coefficient: float = 0.003, preprocess_state: bool = True, gpu_batch_size: int | None = None, learning_rate_decay_iterations: int = 10000000)[source]¶

Bases: AlgorithmConfig

List of configurations for Average Reward TRPO algorithm.

Parameters:

lmb (float) – Scalar of lambda return’s computation in GAE. Defaults to 0.95. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance.
num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 5000.
pi_batch_size (int) – Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 5000.
sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.
maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.
backtrack_coefficient (float) – Coefficient value of linesearch. Defaults to 0.8.
conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.01.
conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 10.
vf_epochs (int) – Number of epochs in each iteration. Defaults to 5.
vf_batch_size (int) – Training batch size of value function. Defaults to 64.
vf_learning_rate (float) – Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 3. * 1e-4.
vf_l2_reg_coefficient (float) – L2 regulization coefficient for the network parameters. Defaults to 3 * 1e-3
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.
gpu_batch_size (int, optional) – Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None.
learning_rate_decay_iterations (int) – learning rate will be decreased lineary to 0 till this iteration number. If 0 or negative, learning rate will be kept fixed. Defaults to 10000000.

class nnabla_rl.algorithms.atrpo.ATRPO(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.atrpo.ATRPOConfig = ATRPOConfig(gpu_id=-1, lmb=0.95, num_steps_per_iteration=5000, pi_batch_size=5000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, backtrack_coefficient=0.8, conjugate_gradient_damping=0.01, conjugate_gradient_iterations=10, vf_epochs=5, vf_batch_size=64, vf_learning_rate=0.00030000000000000003, vf_l2_reg_coefficient=0.003, preprocess_state=True, gpu_batch_size=None, learning_rate_decay_iterations=10000000), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.atrpo.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.atrpo.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.atrpo.DefaultPolicyBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = <nnabla_rl.algorithms.atrpo.DefaultPreprocessorBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.atrpo.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion implementation.

This class implements the Average Reward Trust Region Policy Optimiation (ATRPO) with Generalized Advantage Estimation (GAE) algorithm proposed by Yiming Zhang, et al. and J. Schulman, et al. in the paper: “On-Policy Deep Reinforcement Learning for the Average-Reward Criterion” and “High-Dimensional Continuous Control Using Generalized Advantage Estimation” For detail see: https://arxiv.org/abs/2106.07329 and https://arxiv.org/abs/1506.02438

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (PPOConfig) – configuration of TRPO algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

BCQ¶

class nnabla_rl.algorithms.bcq.BCQConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, phi: float = 0.05, num_q_ensembles: int = 2, num_action_samples: int = 10)[source]¶

Bases: AlgorithmConfig

BCQConfig List of configurations for BCQ algorithm.

Parameters:

gamma (float) – discount factor of reward. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.
phi (float) – action perturbator noise coefficient. Defaults to 0.05.
num_q_ensembles (int) – number of q function ensembles . Defaults to 2.
num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.

class nnabla_rl.algorithms.bcq.BCQ(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.bcq.BCQConfig = BCQConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, phi=0.05, num_q_ensembles=2, num_action_samples=10), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bcq.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, vae_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bcq.DefaultVAEBuilder object>, vae_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, perturbator_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.perturbator.Perturbator] = <nnabla_rl.algorithms.bcq.DefaultPerturbatorBuilder object>, perturbator_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>)[source]¶

Bases: Algorithm

Batch-Constrained Q-learning (BCQ) algorithm.

This class implements the Batch-Constrained Q-learning (BCQ) algorithm proposed by S. Fujimoto, et al. in the paper: “Off-Policy Deep Reinforcement Learning without Exploration” For details see: https://arxiv.org/abs/1812.02900

This algorithm only supports offline training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (BCQConfig) – configuration of the BCQ algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q-function models
q_solver_builder (SolverBuilder) – builder for q-function solvers
vae_builder (ModelBuilder[VariationalAutoEncoder]) – builder of variational auto encoder models
vae_solver_builder (SolverBuilder) – builder for variational auto encoder solvers
perturbator_builder (PerturbatorBuilder) – builder of perturbator models
perturbator_solver_builder (SolverBuilder) – builder for perturbator solvers

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

BEAR¶

class nnabla_rl.algorithms.bear.BEARConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, epsilon: float = 0.05, num_q_ensembles: int = 2, num_mmd_actions: int = 5, num_action_samples: int = 10, mmd_type: str = 'gaussian', mmd_sigma: float = 20.0, initial_lagrange_multiplier: float | None = None, fix_lagrange_multiplier: bool = False, warmup_iterations: int = 20000, use_mean_for_eval: bool = False)[source]¶

Bases: AlgorithmConfig

BEARConfig List of configurations for BEAR algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.
epsilon (float) – inequality constraint of dual gradient descent. Defaults to 0.05.
num_q_ensembles (int) – number of q ensembles . Defaults to 2.
num_mmd_actions (int) – number of actions to sample for computing maximum mean discrepancy (MMD). Defaults to 5.
num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.
mmd_type (str) – kernel type used for MMD computation. laplacian or gaussian is supported. Defaults to gaussian.
mmd_sigma (float) – parameter used for adjusting the MMD. Defaults to 20.0.
initial_lagrange_multiplier (float, optional) – Initial value of lagrange multiplier. If not specified, random value sampled from normal distribution will be used instead.
fix_lagrange_multiplier (bool) – Either to fix the lagrange multiplier or not. Defaults to False.
warmup_iterations (int) – Number of iterations until start updating the policy. Defaults to 20000
use_mean_for_eval (bool) – Use mean value instead of best action among the samples for evaluation. Defaults to False.

class nnabla_rl.algorithms.bear.BEAR(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.bear.BEARConfig = BEARConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, epsilon=0.05, num_q_ensembles=2, num_mmd_actions=5, num_action_samples=10, mmd_type='gaussian', mmd_sigma=20.0, initial_lagrange_multiplier=None, fix_lagrange_multiplier=False, warmup_iterations=20000, use_mean_for_eval=False), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bear.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, pi_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.bear.DefaultPolicyBuilder object>, pi_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, vae_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bear.DefaultVAEBuilder object>, vae_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, lagrange_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>)[source]¶

Bases: Algorithm

Bootstrapping Error Accumulation Reduction (BEAR) algorithm.

This class implements the Bootstrapping Error Accumulation Reduction (BEAR) algorithm proposed by A. Kumar, et al. in the paper: “Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction” For details see: https://arxiv.org/abs/1906.00949

This algorithm only supports offline training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (BEARConfig) – configuration of the BEAR algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q-function models
q_solver_builder (SolverBuilder) – builder for q-function solvers
pi_function_builder (ModelBuilder[StochasticPolicy]) – builder of policy models
pi_solver_builder (SolverBuilder) – builder for policy solvers
vae_builder (ModelBuilder[VariationalAutoEncoder]) – builder of variational auto encoder models
vae_solver_builder (SolverBuilder) – builder for variational auto encoder solvers
lagrange_solver_builder (SolverBuilder) – builder for lagrange multiplier solver

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

Categorical DDQN¶

class nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = -10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean', unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶: Bases: CategoricalDQNConfig

class nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean', unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), value_distribution_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultExplorerBuilder object>)[source]¶

Bases: CategoricalDQN

Categorical Double DQN algorithm.

This class implements the Categorical Double DQN algorithm introduced by M. Bellemare, et al. in the paper: “Rainbow: Combining Improvements in Deep Reinforcement Learning” For details see: https://arxiv.org/abs/1710.02298. The difference between Categorical DQN and this algorithm is the update target of q-value. This algorithm uses following double DQN style q-value target for Categorical Q value update. \(r + \gamma Q_{\text{target}}(s_{t+1}, \arg\max_{a}{Q(s_{t+1}, a)})\).

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (CategoricalDDQNConfig) – configuration of the CategoricalDDQN algorithm
value_distribution_builder (ModelBuilder[ValueDistributionFunctionFunction]) – builder of value distribution function models
value_distribution_solver_builder (SolverBuilder) – builder of value distribution function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

Categorical DQN¶

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = -10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean', unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

CategoricalDQNConfig List of configurations for CategoricalDQN algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 32.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (float) – the interval of learner update. Defaults to 4
target_update_frequency (float) – the interval of target q-function update. Defaults to 10000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.
v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.
num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.
loss_reduction_method (str) – KL loss reduction method. “sum” or “mean” is supported. Defaults to mean.
unroll_steps (int) – Number of steps to unroll tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
burn_in_steps (int) – Number of burn-in steps to initiaze recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean', unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), value_distribution_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Categorical DQN algorithm.

This class implements the Categorical DQN algorithm proposed by M. Bellemare, et al. in the paper: “A Distributional Perspective on Reinfocement Learning” For details see: https://arxiv.org/abs/1707.06887

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (CategoricalDQNConfig) – configuration of the CategoricalDQN algorithm
value_distribution_builder (ModelBuilder[ValueDistributionFunctionFunction]) – builder of value distribution function models
value_distribution_solver_builder (SolverBuilder) – builder of value distribution function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

DDP¶

class nnabla_rl.algorithms.ddp.DDPConfig(gpu_id: int = -1, T_max: int = 50, num_iterations: int = 10, mu_min: float = 1e-06, modification_factor: float = 2.0, accept_improvement_ratio: float = 0.0)[source]¶

Bases: AlgorithmConfig

List of configurations for DDP (Differential Dynamic Programming) algorithm.

Parameters:

T_max (int) – Planning time step length. Defaults to 50.
num_iterations (int) – Number of iterations for the optimization. Defaults to 10.
mu_min (float) – Minimum value for regularizing the hessian of the value funtion. Defaults to 1e-6.
modification_factor (float) – Modification factor for the regularizer. Defaults to 2.0.
accept_improvement_ratio (float) – Threshold value for deciding to accept the update or not. Defaults to 0.0

class nnabla_rl.algorithms.ddp.DDP(env_or_env_info, dynamics: Dynamics, cost_function: CostFunction, config=DDPConfig(gpu_id=-1, T_max=50, num_iterations=10, mu_min=1e-06, modification_factor=2.0, accept_improvement_ratio=0.0))[source]¶

Bases: Algorithm

Differential Dynamic Programming algorithm. This class implements the differential dynamic programming (DDP) algorithm proposed by D. Mayne in the paper: “A Second-order Gradient Method for Determining Optimal Trajectories of Non-linear Discrete-time Systems”. We also referred the paper written by Y. Tassa et al.: “Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization” for the implementation of this algorithm.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
dynamics (Dynamics) – dynamics of the system to control
cost_function (Dynamics) – cost function to optimize the trajectory
config (DDPConfig) – the parameter for DDP controller

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

compute_trajectory(**kwargs)¶

Compute trajectory (sequence of state and action tuples) from given initial trajectory using current policy. Most of the reinforcement learning algorithms does not implement this method. Only the optimal control algorithms implements this method.

Parameters:: initial_trajectory (Sequence[Tuple[np.ndarray, Optional[np.ndarray]]]) – initial trajectory.
Returns:: Sequence of state and action tuples and extra information (if exist) at each timestep, computed with current best policy. Extra information depends on the algorithm. The sequence length is same as the length of initial trajectory.
Return type:: Tuple[Sequence[Tuple[np.ndarray, Optional[np.ndarray]]], Sequence[Dict[str, Any]]]

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

DDPG¶

class nnabla_rl.algorithms.ddpg.DDPGConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

DDPGConfig List of configurations for DDPG algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.ddpg.DDPG(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.ddpg.DDPGConfig = DDPGConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True), critic_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.ddpg.DefaultCriticBuilder object>, critic_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, actor_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.ddpg.DefaultActorBuilder object>, actor_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.ddpg.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.ddpg.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Deep Deterministic Policy Gradient (DDPG) algorithm.

This class implements the modified version of the Deep Deterministic Policy Gradient (DDPG) algorithm proposed by T. P. Lillicrap, et al. in the paper: “Continuous control with deep reinforcement learning” For details see: https://arxiv.org/abs/1509.02971 We use gaussian noise instead of Ornstein-Uhlenbeck process to explore in the environment. The effectiveness of using gaussian noise for DDPG is reported in the paper: “Addressing Funciton Approximaiton Error in Actor-Critic Methods”. see https://arxiv.org/abs/1802.09477

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DDPGConfig) – configuration of the DDPG algorithm
critic_builder (ModelBuilder[QFunction]) – builder of critic models
critic_solver_builder (SolverBuilder) – builder of critic solvers
actor_builder (ModelBuilder[DeterministicPolicy]) – builder of actor models
actor_solver_builder (SolverBuilder) – builder of actor solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

DDQN¶

class nnabla_rl.algorithms.ddqn.DDQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Tuple[float, float] | None = (-1.0, 1.0), unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶

Bases: DQNConfig

List of configurations for Double DQN (DDQN) algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
batch_size (int) – training batch size. Defaults to 32.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.
grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).

class nnabla_rl.algorithms.ddqn.DDQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.ddqn.DDQNConfig = DDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0), unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.dqn.DefaultExplorerBuilder object>)[source]¶

Bases: DQN

Double DQN algorithm.

This class implements the Deep Q-Network with double q-learning (DDQN) algorithm proposed by H. van Hasselt, et al. in the paper: “Deep Reinforcement Learning with Double Q-learning” For details see: https://arxiv.org/abs/1509.06461

Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DDQN. You can replace the solver by implementing a (SolverBuilder) and pass the solver on DDQN class instantiation.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DDQNConfig) – the parameter for DDQN training
q_func_builder (ModelBuilder) – builder of q function model
q_solver_builder (SolverBuilder) – builder of q function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

DecisionTransformer¶

class nnabla_rl.algorithms.decision_transformer.DecisionTransformerConfig(gpu_id: int = -1, learning_rate: float = 0.0006, batch_size: int = 128, context_length: int = 30, max_timesteps: int | None = None, grad_clip_norm: float = 1.0, weight_decay: float = 0.1, target_return: int = 90, reward_scale: float = 1.0)[source]¶

Bases: AlgorithmConfig

List of configurations for DecisionTransformer algorithm.

Parameters:

learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0006.
batch_size (int) – training batch size. Defaults to 128.
context_length (int) – Context length of transformer model. Defaults to 30.
max_timesteps (Optional[int]) – Optional. Maximum timestep of training environmet. If the value is not provided, the algorithm will guess the maximum episode length through EnvironmentInfo.
grad_clip_norm (float) – Gradient clipping threshold for default solver. Defaults to 1.0.
weight_decay (float) – Weight decay parameter for default solver. Defaults to 0.1.
target_return (int) – Initial target return used to compute the evaluation action. Defaults to 90.
reward_scale (float) – Reward scaler. Reward received during evaluation will be multiplied by this value.

class nnabla_rl.algorithms.decision_transformer.DecisionTransformer(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.decision_transformer.DecisionTransformerConfig = DecisionTransformerConfig(gpu_id=-1, learning_rate=0.0006, batch_size=128, context_length=30, max_timesteps=None, grad_clip_norm=1.0, weight_decay=0.1, target_return=90, reward_scale=1.0), transformer_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.decision_transformer.StochasticDecisionTransformer | ~nnabla_rl.models.decision_transformer.DeterministicDecisionTransformer] = <nnabla_rl.algorithms.decision_transformer.DefaultTransformerBuilder object>, transformer_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.decision_transformer.DefaultSolverBuilder object>, transformer_wd_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder | None = None, lr_scheduler_builder: ~nnabla_rl.builders.lr_scheduler_builder.LearningRateSchedulerBuilder | None = None)[source]¶

Bases: Algorithm

DecisionTransformer algorithm.

This class implements the DecisionTransformer algorithm proposed by L. Chen, et al. in the paper: “Decision Transformer: Reinforcement Learning via Sequence Modeling” For details see: https://arxiv.org/abs/2106.01345

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DecisionTransformerConfig) – the parameter for DecisionTransformer training
transformer_builder (ModelBuilder) – builder of transformer model
solver_builder (SolverBuilder) – builder of transformer solver

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

DQN¶

class nnabla_rl.algorithms.dqn.DQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Tuple[float, float] | None = (-1.0, 1.0), unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

List of configurations for DQN algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
batch_size (int) – training batch size. Defaults to 32.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.
grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).
unroll_steps (int) – Number of steps to unroll tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
burn_in_steps (int) – Number of burn-in steps to initiaze recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.dqn.DQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.dqn.DQNConfig = DQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0), unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.dqn.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

DQN algorithm.

This class implements the Deep Q-Network (DQN) algorithm proposed by V. Mnih, et al. in the paper: “Human-level control through deep reinforcement learning” For details see: https://www.nature.com/articles/nature14236

Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DQN. You can replace the solver by implementing a (SolverBuilder) and pass the solver on DQN class instantiation.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DQNConfig) – the parameter for DQN training
q_func_builder (ModelBuilder) – builder of q function model
q_solver_builder (SolverBuilder) – builder of q function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

DRQN¶

class nnabla_rl.algorithms.drqn.DRQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.1, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 400000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Tuple[float, float] | None = (-1.0, 1.0), unroll_steps: int = 10, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = False, clip_grad_norm: float = 10.0)[source]¶

Bases: DQNConfig

List of configurations for DRQN algorithm. Most of the configs are inherited from DQNConfig.

Parameters:

clip_grad_norm (float) – Limit the model parameter’s gradient on parameter updates up to this value. If you implement SolverBuilder by yourself, this value will not take effect. Defaults to 10.0.
learning_rate (float) – Solver learning rate. Value overridden from DQN. Defaults to 0.1.
replay_buffer_size (int) – Replay buffer size. Value overridden from DQN. Defaults to 400000.
unroll_steps (int) – Number of steps to unroll recurrent layer during training. Value overridden from DQN. Defaults to 10.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. Value overridden from DQN. Defaults to False.

class nnabla_rl.algorithms.drqn.DRQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.drqn.DRQNConfig = DRQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.1, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=400000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0), unroll_steps=10, burn_in_steps=0, reset_rnn_on_terminal=False, clip_grad_norm=10.0), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.drqn.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.drqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.dqn.DefaultExplorerBuilder object>)[source]¶

Bases: DQN

DRQN algorithm.

This class implements the Bootstrapped random update version of Deep Recurrent Q-Network (DRQN) algorithm. proposed by M. Hausknecht, et al. in the paper: “Deep Recurrent Q-Learning for Partially Observable MDPs” For details see: https://arxiv.org/pdf/1507.06527.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DRQNConfig) – the parameter for DRQN training
q_func_builder (ModelBuilder) – builder of q function model
q_solver_builder (SolverBuilder) – builder of q function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

GAIL¶

class nnabla_rl.algorithms.gail.GAILConfig(gpu_id: int = -1, preprocess_state: bool = True, act_deterministic_in_eval: bool = True, discriminator_batch_size: int = 50000, discriminator_learning_rate: float = 0.01, discriminator_update_frequency: int = 1, adversary_entropy_coef: float = 0.001, policy_update_frequency: int = 1, gamma: float = 0.995, lmb: float = 0.97, pi_batch_size: int = 50000, num_steps_per_iteration: int = 50000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 10, vf_epochs: int = 5, vf_batch_size: int = 128, vf_learning_rate: float = 0.001)[source]¶

Bases: AlgorithmConfig

List of configurations for GAIL algorithm.

Parameters:

act_deterministic_in_eval (bool) – Enable act deterministically at evalution. Defaults to True.
discriminator_batch_size (bool) – Trainig batch size of discriminator. Usually, discriminator_batch_size is the same as pi_batch_size. Defaults to 50000.
discriminator_learning_rate (float) – Learning rate which is set to the solvers of dicriminator function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
discriminator_update_frequency (int) – Frequency (measured in the number of parameter update) of discriminator update. Defaults to 1.
adversary_entropy_coef (float) – Coefficient of entropy loss in dicriminator training. Defaults to 0.001.
policy_update_frequency (int) – Frequency (measured in the number of parameter update) of policy update. Defaults to 1.
gamma (float) – Discount factor of rewards. Defaults to 0.995.
lmb (float) – Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance.
num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 50000.
pi_batch_size (int) – Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 50000.
sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.
maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.
conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.
conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 10.
vf_epochs (int) – Number of epochs in each iteration. Defaults to 5.
vf_batch_size (int) – Training batch size of value function. Defaults to 128.
vf_learning_rate (float) – Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

class nnabla_rl.algorithms.gail.GAIL(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, expert_buffer: ~nnabla_rl.replay_buffer.ReplayBuffer, config: ~nnabla_rl.algorithms.gail.GAILConfig = GAILConfig(gpu_id=-1, preprocess_state=True, act_deterministic_in_eval=True, discriminator_batch_size=50000, discriminator_learning_rate=0.01, discriminator_update_frequency=1, adversary_entropy_coef=0.001, policy_update_frequency=1, gamma=0.995, lmb=0.97, pi_batch_size=50000, num_steps_per_iteration=50000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=10, vf_epochs=5, vf_batch_size=128, vf_learning_rate=0.001), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.gail.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultVFunctionSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.gail.DefaultPolicyBuilder object>, reward_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.reward_function.RewardFunction] = <nnabla_rl.algorithms.gail.DefaultRewardFunctionBuilder object>, reward_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultRewardFunctionSolverBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = <nnabla_rl.algorithms.gail.DefaultPreprocessorBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.gail.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Generative Adversarial Imitation Learning implementation.

This class implements the Generative Adversarial Imitation Learning (GAIL) algorithm proposed by Jonathan Ho, et al. in the paper: “Generative Adversarial Imitation Learning” For detail see: https://arxiv.org/abs/1606.03476

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
expert_buffer (ReplayBuffer) – replay buffer which contains expert experience.
config (GAILConfig) – configuration of GAIL algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
reward_function_builder (ModelBuilder[RewardFunction]) – builder of reward function models
reward_solver_builder (SolverBuilder) – builder for reward function solvers
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

HER¶

class nnabla_rl.algorithms.her.HERConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True, n_cycles: int = 50, n_rollout: int = 16, n_update: int = 40, max_timesteps: int = 50, hindsight_prob: float = 0.8, action_loss_coef: float = 1.0, return_clip: Tuple[float, float] | None = (-50.0, 0.0), exploration_epsilon: float = 0.3, preprocess_state: bool = True, normalize_epsilon: float = 0.01, normalize_clip_range: Tuple[float, float] | None = (-5.0, 5.0), observation_clip_range: Tuple[float, float] | None = (-200.0, 200.0))[source]¶

Bases: DDPGConfig

HERConfig List of configurations for HER algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.
n_cycles (int) – the number of cycle. A cycle means collecting experiences for some episodes and updating model for several times.
n_rollout (int) – the number of episode in which policy collect experiences.
n_update (int) – the number of updating model
max_timesteps (int) – the timestep when finishing one epsode.
hindsight_prob (float) – the probability at which buffer samples hindsight goal.
action_loss_coef (float) – the value of coefficient about action loss in policy trainer.
return_clip (Optional[Tuple[float, float]]) – the range of clipping return value.
exploration_epsilon (float) – the value for ε-greedy explorer.
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.
normalize_epsilon (float) – the minimum value of standard deviation of preprocessed state.
normalize_clip_range (Optional[Tuple[float, float]]) – the range of clipping state.
observation_clip_range (Optional[Tuple[float, float]]) – the range of clipping observation.

class nnabla_rl.algorithms.her.HER(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.her.HERConfig = HERConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True, n_cycles=50, n_rollout=16, n_update=40, max_timesteps=50, hindsight_prob=0.8, action_loss_coef=1.0, return_clip=(-50.0, 0.0), exploration_epsilon=0.3, preprocess_state=True, normalize_epsilon=0.01, normalize_clip_range=(-5.0, 5.0), observation_clip_range=(-200.0, 200.0)), critic_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.her.HERCriticBuilder object>, critic_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.her.HERSolverBuilder object>, actor_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.her.HERActorBuilder object>, actor_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.her.HERSolverBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = <nnabla_rl.algorithms.her.HERPreprocessorBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.her.HindsightReplayBufferBuilder object>)[source]¶

Bases: DDPG

Hindsight Experience Replay (HER) algorithm implementation.

This class implements the Hindsight Experience Replay (HER) algorithm proposed by M. Andrychowicz, et al. in the paper: “Hindsight Experience Replay” For detail see: https://arxiv.org/abs/1707.06347

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (HERConfig) – configuration of HER algorithm
critic_builder (ModelBuilder[VFunction]) – builder of critic models
critic_solver_builder (SolverBuilder) – builder for critic solvers
actor_builder (ModelBuilder[StochasicPolicy]) – builder of actor models
actor_solver_builder (SolverBuilder) – builder for actor solvers
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

HyAR¶

class nnabla_rl.algorithms.hyar.HyARConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, d: int = 2, exploration_noise_sigma: float = 0.1, train_action_noise_sigma: float = 0.1, train_action_noise_abs: float = 0.5, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True, noisy_action_min: float = -1.0, noisy_action_max: float = -1.0, latent_dim: int = 6, embed_dim: int = 6, T: int = 10, vae_pretrain_episodes: int = 20000, vae_pretrain_batch_size: int = 64, vae_pretrain_times: int = 5000, vae_training_batch_size: int = 64, vae_training_times: int = 1, vae_learning_rate: float = 0.0001, vae_buffer_size: int = 2000000, latent_select_batch_size: int = 5000, latent_select_range: float = 96.0, noise_decay_steps: int = 1000, initial_exploration_noise: float = 1.0, final_exploration_noise: float = 0.1)[source]¶

Bases: TD3Config

HyARConfig List of configurations for HyAR algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.003.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
d (int) – Interval of the policy update. The policy will be updated every d q-function updates. Defaults to 2.
exploration_noise_sigma (float) – Standard deviation of the gaussian exploration noise. Defaults to 0.1.
train_action_noise_sigma (float) – Standard deviation of the gaussian action noise used in the training. Defaults to 0.5.
train_action_noise_abs (float) – Absolute limit value of action noise used in the training. Defaults to 0.5.
noisy_action_max (float) – Maximum value of the training action after appending the noise. Defaults to 1.0.
noisy_action_min (float) – Minimum value of the training action after appending the noise. Defaults to -1.0.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
latent_dim (int) – Latent action’s dimension. Defaults to 6. embed_dim (int): Discrete action embedding’s dimension. Defaults to 6. T (int): VAE training interval. VAE is trained every T episodes. Defaults to 10. vae_pretrain_episodes (float): Number of data collection episodes for vae pretraining. Defaults to 20000. vae_pretrain_batch_size (int): Batch size used in vae pretraining. Defaults to 64. vae_pretrain_times (int): VAE is updated for this number of iterations during the pretrain stage. Defaults to 5000. vae_training_batch_size (int): batch size used in vae training. Defaults to 64. vae_training_times (int): VAE is updated for this number of iterations every T steps. Defaults to 1. vae_learning_rate (float): VAE learning rate. Defaults to 1e-4. vae_buffer_size (int): Replay buffer size for VAE model. Defaults to 200000. latent_select_batch_size: (int): Batch size for computing latent space constraint (LSC). Defaults to 5000. latent_select_range: (float): Percentage of the latent variables in central range. Default to 96.
noise_decay_steps (int) – Exploration noise decay steps. Noise decays for this number of experienced episodes. Defaults to 1000. initial_exploration_noise (float): Initial standard deviation of exploration noise. Defaults to 1.0.
final_exploration_noise (float) – Final standard deviation of exploration noise. Defaults to 0.1.

class nnabla_rl.algorithms.hyar.HyAR(env_or_env_info, config: ~nnabla_rl.algorithms.hyar.HyARConfig = HyARConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, d=2, exploration_noise_sigma=0.1, train_action_noise_sigma=0.1, train_action_noise_abs=0.5, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True, noisy_action_min=-1.0, noisy_action_max=-1.0, latent_dim=6, embed_dim=6, T=10, vae_pretrain_episodes=20000, vae_pretrain_batch_size=64, vae_pretrain_times=5000, vae_training_batch_size=64, vae_training_times=1, vae_learning_rate=0.0001, vae_buffer_size=2000000, latent_select_batch_size=5000, latent_select_range=96.0, noise_decay_steps=1000, initial_exploration_noise=1.0, final_exploration_noise=0.1), critic_builder=<nnabla_rl.algorithms.hyar.DefaultCriticBuilder object>, critic_solver_builder=<nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, actor_builder=<nnabla_rl.algorithms.hyar.DefaultActorBuilder object>, actor_solver_builder=<nnabla_rl.algorithms.hyar.DefaultActorSolverBuilder object>, vae_builder=<nnabla_rl.algorithms.hyar.DefaultVAEBuilder object>, vae_solver_buidler=<nnabla_rl.algorithms.hyar.DefaultVAESolverBuilder object>, replay_buffer_builder=<nnabla_rl.algorithms.hyar.DefaultReplayBufferBuilder object>, vae_buffer_builder=<nnabla_rl.algorithms.hyar.DefaultVAEBufferBuilder object>, explorer_builder=<nnabla_rl.algorithms.hyar.DefaultExplorerBuilder object>, pretrain_explorer_builder=<nnabla_rl.algorithms.hyar.DefaultPretrainExplorerBuilder object>)[source]¶

Bases: TD3

HyAR algorithm.

This class implements the Hybrid Action Representation (HyAR) algorithm proposed by Boyan Li, et al. in the paper: “HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation” For details see: https://openreview.net/pdf?id=64trBbOhdGU

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DQNConfig) – the parameter for DQN training
critic_func_builder (ModelBuilder) – builder of q function model
critic_solver_builder (SolverBuilder) – builder of q function solver
actor_func_builder (ModelBuilder) – builder of policy model
actor_solver_builder (SolverBuilder) – builder of policy solver
vae_builder (ModelBuilder) – builder of vae model
vae_solver_builder (SolverBuilder) – builder of vae solver
replay_buffer_builder (ReplayBufferBuilder) – builder of q-function and policy replay_buffer
vae_buffer_builder (ReplayBufferBuilder) – builder of vae’s replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer for main training stage
pretrain_explorer_builder (ExplorerBuilder) – builder of environment explorer for pretraining stage

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

iLQR¶

class nnabla_rl.algorithms.ilqr.iLQRConfig(gpu_id: int = -1, T_max: int = 50, num_iterations: int = 10, mu_min: float = 1e-06, modification_factor: float = 2.0, accept_improvement_ratio: float = 0.0)[source]¶: Bases: DDPConfig

class nnabla_rl.algorithms.ilqr.iLQR(env_or_env_info, dynamics: Dynamics, cost_function: CostFunction, config=iLQRConfig(gpu_id=-1, T_max=50, num_iterations=10, mu_min=1e-06, modification_factor=2.0, accept_improvement_ratio=0.0))[source]¶

Bases: DDP

Iterative LQR (Linear Quadratic Regulator) algorithm. This class implements the iterative Linear Quadratic Requlator (iLQR) algorithm proposed by Y. Tassa, et al. in the paper: “Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization” For details see: https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
dynamics (Dynamics) – dynamics of the system to control
cost_function (Dynamics) – cost function to optimize the trajectory
config (iLQRConfig) – the parameter for iLQR controller

IQN¶

class nnabla_rl.algorithms.iqn.IQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64, unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

List of configurations for IQN algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.
batch_size (int) – training batch size. Defaults to 32.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.
N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.
K (int) – Number of samples to compute greedy next action. Defaults to 32.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.
unroll_steps (int) – Number of steps to unroll tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
burn_in_steps (int) – Number of burn-in steps to initiaze recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.

class nnabla_rl.algorithms.iqn.IQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.iqn.IQNConfig = IQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64, unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), quantile_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.iqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.iqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.iqn.DefaultExplorerBuilder object>, risk_measure_function=<function risk_neutral_measure>)[source]¶

Bases: Algorithm

Implicit Quantile Network algorithm.

This class implements the Implicit Quantile Network (IQN) algorithm proposed by W. Dabney, et al. in the paper: “Implicit Quantile Networks for Distributional Reinforcement Learning” For details see: https://arxiv.org/pdf/1806.06923.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (IQNConfig) – configuration of IQN algorithm
quantile_function_builder (ModelBuilder[StateActionQuantileFunction]) – buider of state-action quantile function models
quantile_solver_builder (SolverBuilder) – builder for state action quantile function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

LQR¶

class nnabla_rl.algorithms.lqr.LQRConfig(gpu_id: int = -1, T_max: int = 50)[source]¶

Bases: AlgorithmConfig

List of configurations for LQR (Linear Quadratic Regulator) algorithm.

Parameters:: T_max (int) – Planning time step length. Defaults to 50.

class nnabla_rl.algorithms.lqr.LQR(env_or_env_info, dynamics: Dynamics, cost_function: CostFunction, config=LQRConfig(gpu_id=-1, T_max=50))[source]¶

Bases: Algorithm

LQR (Linear Quadratic Regulator) algorithm.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
dynamics (Dynamics) – dynamics of the system to control
cost_function (Dynamics) – cost function to optimize the trajectory
config (LQRConfig) – the parameter for LQR controller

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

compute_trajectory(**kwargs)¶

Compute trajectory (sequence of state and action tuples) from given initial trajectory using current policy. Most of the reinforcement learning algorithms does not implement this method. Only the optimal control algorithms implements this method.

Parameters:: initial_trajectory (Sequence[Tuple[np.ndarray, Optional[np.ndarray]]]) – initial trajectory.
Returns:: Sequence of state and action tuples and extra information (if exist) at each timestep, computed with current best policy. Extra information depends on the algorithm. The sequence length is same as the length of initial trajectory.
Return type:: Tuple[Sequence[Tuple[np.ndarray, Optional[np.ndarray]]], Sequence[Dict[str, Any]]]

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

MMESAC¶

class nnabla_rl.algorithms.mme_sac.MMESACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, reward_scalar: float = 5.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1, num_steps: int = 1, pi_unroll_steps: int = 1, pi_burn_in_steps: int = 0, pi_reset_rnn_on_terminal: bool = True, q_unroll_steps: int = 1, q_burn_in_steps: int = 0, q_reset_rnn_on_terminal: bool = True, v_unroll_steps: int = 1, v_burn_in_steps: int = 0, v_reset_rnn_on_terminal: bool = True, alpha_pi: float | None = None, alpha_q: float = 1.0)[source]¶

Bases: ICML2018SACConfig

MMESACConfig List of configurations for MMESAC algorithm.

Parameters:

alpha_pi (Optional[float]) – If None, will use reward_scalar to scale the reward. Otherwise 1/alpha_pi will be used to scale the reward. Defaults to None.
alpha_q (float) – Temperature value for negative entropy term. Defaults to 1.0.

class nnabla_rl.algorithms.mme_sac.MMESAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.mme_sac.MMESACConfig = MMESACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, reward_scalar=5.0, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1, num_steps=1, pi_unroll_steps=1, pi_burn_in_steps=0, pi_reset_rnn_on_terminal=True, q_unroll_steps=1, q_burn_in_steps=0, q_reset_rnn_on_terminal=True, v_unroll_steps=1, v_burn_in_steps=0, v_reset_rnn_on_terminal=True, alpha_pi=None, alpha_q=1.0), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2018_sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultExplorerBuilder object>)[source]¶

Bases: ICML2018SAC

Max-Min Entropy Soft Actor-Critic (MME-SAC) algorithm.

This class implements the Max-Min Entropy Soft Actor Critic (MME-SAC) algorithm proposed by S. Han, et al. in the paper: “A Max-Min Entropy Framework for Reinforcement Learning” For details see: https://arxiv.org/abs/2106.10517

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (MMESACConfig) – configuration of the MMESAC algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder of v function solvers
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

MMESAC (Disentangled)¶

class nnabla_rl.algorithms.demme_sac.DEMMESACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1, num_rr_steps: int = 1, num_re_steps: int = 1, reward_scalar: float = 5.0, alpha_pi: float | None = None, alpha_q: float = 1.0, pi_t_unroll_steps: int = 1, pi_t_burn_in_steps: int = 0, pi_t_reset_rnn_on_terminal: bool = True, pi_e_unroll_steps: int = 1, pi_e_burn_in_steps: int = 0, pi_e_reset_rnn_on_terminal: bool = True, q_rr_unroll_steps: int = 1, q_rr_burn_in_steps: int = 0, q_rr_reset_rnn_on_terminal: bool = True, q_re_unroll_steps: int = 1, q_re_burn_in_steps: int = 0, q_re_reset_rnn_on_terminal: bool = True, v_rr_unroll_steps: int = 1, v_rr_burn_in_steps: int = 0, v_rr_reset_rnn_on_terminal: bool = True, v_re_unroll_steps: int = 1, v_re_burn_in_steps: int = 0, v_re_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

DEMMESACConfig List of configurations for DEMMESAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
reward_scalar (float) – Reward scaling factor. Obtained reward will be multiplied by this value. Defaults to 5.0.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_rr_steps (int) – number of steps for N-step Q_rr targets. Defaults to 1.
num_re_steps (int) – number of steps for N-step Q_re targets. Defaults to 1.
target_update_interval (float) – the interval of target v function parameter’s update. Defaults to 1.
pi_t_unroll_steps (int) – Number of steps to unroll policy’s (pi_t) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
pi_e_unroll_steps (int) – Number of steps to unroll policy’s (pi_e) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
pi_t_burn_in_steps (int) – Number of burn-in steps to initiaze policy’s (pi_t) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
pi_e_burn_in_steps (int) – Number of burn-in steps to initiaze policy’s (pi_e) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
pi_t_reset_rnn_on_terminal (bool) – Reset policy’s (pi_t) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
pi_e_reset_rnn_on_terminal (bool) – Reset policy’s (pi_e) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
q_rr_unroll_steps (int) – Number of steps to unroll q-function’s (q_rr) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
q_re_unroll_steps (int) – Number of steps to unroll q-function’s (q_re) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
q_rr_burn_in_steps (int) – Number of burn-in steps to initiaze q-function’s (q_rr) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
q_re_burn_in_steps (int) – Number of burn-in steps to initiaze q-function’s (q_re) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
q_rr_reset_rnn_on_terminal (bool) – Reset q-function’s (q_rr) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
q_re_reset_rnn_on_terminal (bool) – Reset q-function’s (q_re) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
v_rr_unroll_steps (int) – Number of steps to unroll v-function’s (v_rr) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
v_re_unroll_steps (int) – Number of steps to unroll v-function’s (v_re) tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
v_rr_burn_in_steps (int) – Number of burn-in steps to initiaze v-function’s (v_rr) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
v_re_burn_in_steps (int) – Number of burn-in steps to initiaze v-function’s (v_re) recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
v_rr_reset_rnn_on_terminal (bool) – Reset v-function’s (v_rr) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
v_re_reset_rnn_on_terminal (bool) – Reset v-function’s (v_re) recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
alpha_pi (Optional[float]) – If None, will use reward_scalar to scale the reward. Otherwise 1/alpha_pi will be used to scale the reward. Defaults to None.
alpha_q (float) – Temperature value for negative entropy term. Defaults to 1.0.

class nnabla_rl.algorithms.demme_sac.DEMMESAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.demme_sac.DEMMESACConfig = DEMMESACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1, num_rr_steps=1, num_re_steps=1, reward_scalar=5.0, alpha_pi=None, alpha_q=1.0, pi_t_unroll_steps=1, pi_t_burn_in_steps=0, pi_t_reset_rnn_on_terminal=True, pi_e_unroll_steps=1, pi_e_burn_in_steps=0, pi_e_reset_rnn_on_terminal=True, q_rr_unroll_steps=1, q_rr_burn_in_steps=0, q_rr_reset_rnn_on_terminal=True, q_re_unroll_steps=1, q_re_burn_in_steps=0, q_re_reset_rnn_on_terminal=True, v_rr_unroll_steps=1, v_rr_burn_in_steps=0, v_rr_reset_rnn_on_terminal=True, v_re_unroll_steps=1, v_re_burn_in_steps=0, v_re_reset_rnn_on_terminal=True), v_rr_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.demme_sac.DefaultVFunctionBuilder object>, v_rr_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, v_re_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.demme_sac.DefaultVFunctionBuilder object>, v_re_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, q_rr_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.demme_sac.DefaultQFunctionBuilder object>, q_rr_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, q_re_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.demme_sac.DefaultQFunctionBuilder object>, q_re_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, pi_t_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.demme_sac.DefaultPolicyBuilder object>, pi_t_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, pi_e_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.demme_sac.DefaultPolicyBuilder object>, pi_e_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.demme_sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.demme_sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.demme_sac.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

DisEntangled Max-Min Entropy Soft Actor-Critic (DEMME-SAC) algorithm.

This class implements the disentangled version of max-min Soft Actor Critic (SAC) algorithm proposed by S. Han, et al. in the paper: “A Max-Min Entropy Framework for Reinforcement Learning” For detail see: https://arxiv.org/abs/2106.10517

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DEMMESACConfig) – configuration of the DEMMESAC algorithm
v_rr_function_builder (ModelBuilder[VFunction]) – builder of reward v function models
v_rr_solver_builder (SolverBuilder) – builder of reward v function solvers
v_re_function_builder (ModelBuilder[VFunction]) – builder of entropy v function models
v_re_solver_builder (SolverBuilder) – builder of entropyv function solvers
q_rr_function_builder (ModelBuilder[QFunction]) – builder of reward q function models
q_rr_solver_builder (SolverBuilder) – builder of reward q function solvers
q_re_function_builder (ModelBuilder[QFunction]) – builder of entropy q function models
q_re_solver_builder (SolverBuilder) – builder of entropy q function solvers
pi_t_builder (ModelBuilder[StochasticPolicy]) – builder of target policy models
pi_t_solver_builder (SolverBuilder) – builder of target policy solvers
pi_e_builder (ModelBuilder[StochasticPolicy]) – builder of pure exploration policy models
pi_e_solver_builder (SolverBuilder) – builder of pure exploration policy solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

MPPI¶

class nnabla_rl.algorithms.mppi.MPPIConfig(gpu_id: int = -1, learning_rate: float = 0.001, batch_size: int = 100, replay_buffer_size: int = 1000000, training_iterations: int = 500, lmb: float = 1.0, M: int = 1, K: int = 500, T: int = 100, covariance: ndarray | None = None, use_known_dynamics: bool = False, unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = False, dt: float = 0.05)[source]¶

Bases: AlgorithmConfig

List of configurations for MPPI (Model Predictive Path Integral) algorithm.

Parameters:

learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
batch_size (int) – training batch size. Defaults to 100.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
training_iterations (int) – dynamics training iterations. Defaults to 500.
lmb (float) – scalar variable lambda used in the difinision of free-energy.
M (int) – number of trials per training iteration. Defaults to 1.
K (int) – number of samples for importance sampling. Defaults to 100.
T (int) – number of prediction steps. Defaults to 100.
covariance (np.ndarray) – Covariance of gaussian noise applied to control inputs. If covariance is not specified, covariance with unit variance will be used. Defaults to None.
use_known_dynamics (bool) – Use the dynamics model passed to the MPPI algorithm instead of trained model to compute actions.
unroll_steps (int) – Number of steps to unroll dynamics’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
burn_in_steps (int) – Number of burn-in steps to initiaze dynamics’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
dt (float) – Time interval between states. Defaults to 0.05 [s]. We strongly recommended to adjust this interval considering the sensor frequency.

class nnabla_rl.algorithms.mppi.MPPI(env_or_env_info, cost_function: ~nnabla_rl.numpy_models.cost_function.CostFunction, known_dynamics: ~nnabla_rl.numpy_models.dynamics.Dynamics | None = None, state_normalizer: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] | None = None, config: ~nnabla_rl.algorithms.mppi.MPPIConfig = MPPIConfig(gpu_id=-1, learning_rate=0.001, batch_size=100, replay_buffer_size=1000000, training_iterations=500, lmb=1.0, M=1, K=500, T=100, covariance=None, use_known_dynamics=False, unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=False, dt=0.05), dynamics_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.dynamics.DeterministicDynamics] = <nnabla_rl.algorithms.mppi.DefaultDynamicsBuilder object>, dynamics_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.mppi.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.mppi.DefaultReplayBufferBuilder object>)[source]¶

Bases: Algorithm

MPPI (Model Predictive Path Integral) algorithm. This class implements the model predictive path integral (MPPI) algorithm proposed by G. Williams, et al. in the paper: “Information Theoretic MPC for Model-Based Reinforcement Learning” For details see: https://homes.cs.washington.edu/~bboots/files/InformationTheoreticMPC.pdf.

Our implementation of MPPI assumes that environment’s state consists of elements in the following order. \((x_1, x_2, \cdots, x_n, \frac{dx_1}{dt}, \frac{dx_2}{dt}, \cdots, \frac{dx_n}{dt})\). For example if you have two variables \(x\) and \(\theta\), then the state should be. \((x, \theta, \dot{x}, \dot{\theta})\) and not \((x, \dot{x}, \theta, \dot{\theta})\).

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
cost_function (CostFunction) – cost function to optimize the trajectory
known_dynamics (Dynamics) – Dynamics model of target system to control. If this argument is not None, the algorithm will use the given dynamics model to compute the control input when compute_eval_action and compute_trajectory is called. This argument is optional. Defaults to None.
state_normalizer (Optional[Callable[[np.ndarray], np.ndarray]]) –
Optional. State normalizing function is used to normalize state predicted state values to fit in proper range. For example you can provide state normalizer to fit \(\theta\) in \(-\pi\leq\theta\leq\pi\)

Default is None.
config (MPPIConfig) – the parameter for MPPI controller
dynamics_builder (ModelBuilder[DeterministicDynamics]) – builder of deterministic dynamics models
dynamics_solver_builder (SolverBuilder) – builder of dynamics solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer. If you have bootstrap data, override the default builder and return a replay buffer with bootstrap data.

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

compute_trajectory(**kwargs)¶

Compute trajectory (sequence of state and action tuples) from given initial trajectory using current policy. Most of the reinforcement learning algorithms does not implement this method. Only the optimal control algorithms implements this method.

Parameters:: initial_trajectory (Sequence[Tuple[np.ndarray, Optional[np.ndarray]]]) – initial trajectory.
Returns:: Sequence of state and action tuples and extra information (if exist) at each timestep, computed with current best policy. Extra information depends on the algorithm. The sequence length is same as the length of initial trajectory.
Return type:: Tuple[Sequence[Tuple[np.ndarray, Optional[np.ndarray]]], Sequence[Dict[str, Any]]]

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

Munchausen DQN¶

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, grad_clip: Tuple[float, float] | None = (-1.0, 1.0), unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = -1)[source]¶

Bases: DQNConfig

List of configurations for Munchausen DQN algorithm.

Parameters:

learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.
munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.
clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig = MunchausenDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, grad_clip=(-1.0, 1.0), unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultQSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.dqn.DefaultExplorerBuilder object>)[source]¶

Bases: DQN

Munchausen-DQN algorithm.

This class implements the Munchausen-DQN (Munchausen Deep Q Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (MunchausenDQNConfig) – configuration of MunchausenDQN algorithm
q_func_builder (ModelBuilder[QFunction]) – builder of q-function models
q_solver_builder (SolverBuilder) – builder for q-function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

Munchausen IQN¶

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64, unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = -1)[source]¶

Bases: IQNConfig

List of configurations for Munchausen IQN algorithm.

Parameters:

entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.
munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.
clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig = MunchausenIQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64, unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), risk_measure_function=<function risk_neutral_measure>, quantile_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.iqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.iqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.iqn.DefaultExplorerBuilder object>)[source]¶

Bases: IQN

Munchausen-IQN algorithm implementation.

This class implements the Munchausen-IQN (Munchausen Implicit Quantile Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (MunchausenIQNConfig) – configuration of MunchausenIQN algorithm
risk_measure_function (Callable[[nn.Variable], nn.Variable]) – risk measure function to apply to the quantiles.
quantile_function_builder (ModelBuilder[StateActionQuantileFunction]) – builder of state-action quantile function models
quantile_solver_builder (SolverBuilder) – builder for state action quantile function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

Option Critic Architecture¶

class nnabla_rl.algorithms.option_critic.OptionCriticConfig(gpu_id: int = -1, gamma: float = 0.99, intra_policy_learning_rate: float = 0.00025, termination_function_learning_rate: float = 0.00025, option_v_function_learning_rate: float = 0.00025, option_v_batch_size: int = 32, termination_function_batch_size: int = 1, intra_policy_batch_size: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_option_explore_steps: int = 1000000, initial_option_epsilon: float = 1.0, final_option_epsilon: float = 0.1, test_option_epsilon: float = 0.05, advantage_offset: float = 0.01, entropy_regularizer_coefficient: float = 0.01, use_baseline: bool = True, num_options: int = 8, option_v_loss_reduction_method: str = 'sum', intra_policy_loss_reduction_method: str = 'mean', termination_function_loss_reduction_method: str = 'mean', deterministic_termination_in_eval: bool = False, deterministic_intra_action_in_eval: bool = False)[source]¶

Bases: AlgorithmConfig

List of configurations for Option Critic Architecture algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
intra_policy_learning_rate (float) – learning rate which is set to intra policy solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
termination_function_learning_rate (float) – learning rate which is set to termination function solver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
option_v_function__learning_rate (float) – learning rate which is set to option value function sulver. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
option_v_batch_size (int) – training batch size of option value function. Defaults to 32.
termination_function_batch_size (int) – training batch size of termination function function. Defaults to 1.
intra_policy_batch_size (int) – training batch size of intra policy. Defaults to 1.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.
advantage_offset (float) – advantage offset value for termination function learning. Defaults to 0.01.
entropy_regularizer_coefficient (float) – scalar of entropy regularization term of intra policy learning. Defaults to 0.01.
use_baseline (bool) – If True, subtracting the baseline value from the q value in intra policy learning. Defaults to True.
num_options (int) – number of options. Defaults to 8.
option_v_loss_reduction_method (str) – The reduction method for option v function loss. Defaults to ‘sum’.
intra_policy_loss_reduction_method (str) – The reduction method for intra policy loss. Defaults to ‘mean’.
termination_function_loss_reduction_method (str) – The reduction method for termination function loss. Defaults to ‘mean’.
deterministic_termination_in_eval (bool) – If true, terminates deterministically at evalution. Defaults to False.
deterministic_intra_action_in_eval (bool) – If true, act deterministically at evalution. Defaults to False.

class nnabla_rl.algorithms.option_critic.OptionCritic(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.option_critic.OptionCriticConfig = OptionCriticConfig(gpu_id=-1, gamma=0.99, intra_policy_learning_rate=0.00025, termination_function_learning_rate=0.00025, option_v_function_learning_rate=0.00025, option_v_batch_size=32, termination_function_batch_size=1, intra_policy_batch_size=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_option_explore_steps=1000000, initial_option_epsilon=1.0, final_option_epsilon=0.1, test_option_epsilon=0.05, advantage_offset=0.01, entropy_regularizer_coefficient=0.01, use_baseline=True, num_options=8, option_v_loss_reduction_method='sum', intra_policy_loss_reduction_method='mean', termination_function_loss_reduction_method='mean', deterministic_termination_in_eval=False, deterministic_intra_action_in_eval=False), option_v_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.option_value_function.OptionValueFunction] = <nnabla_rl.algorithms.option_critic.DefaultOptionValueFunctionBuilder object>, option_v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.option_critic.DefaultOptionVFunctionSolverBuilder object>, intra_policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.intra_policy.StochasticIntraPolicy] = <nnabla_rl.algorithms.option_critic.DefaultIntraPolicyBuilder object>, intra_policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.option_critic.DefaultIntraPolicySolverBuilder object>, termination_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.termination_function.StochasticTerminationFunction] = <nnabla_rl.algorithms.option_critic.DefaultTerminationFunctionBuilder object>, termination_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.option_critic.DefaultTerminationFunctionSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.option_critic.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.option_critic.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Option Critic algorithm.

This class implements the Option Critic Architecture algorithm proposed by Pierre-Luc Bacon, et al. in the paper: “The Option-Critic Architecture” For details see: https://arxiv.org/abs/1609.05140

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (OptionCriticConfig) – configuration of Option Critic algorithm
option_v_func_builder (ModelBuilder[OptionValueFunction]) – buider of option value function model
option_v_solver_builder (SolverBuilder) – builder for option value function solver
intra_policy_builder (ModelBuilder[IntraPolicy]) – buider of intra policy function model
intra_policy_solver_builder (SolverBuilder) – builder for option value function solver
termination_function_builder (ModelBuilder[TerminationFunction]) – buider of termination function model
termination_function_solver_builder (SolverBuilder) – builder for termination function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

PPO¶

class nnabla_rl.algorithms.ppo.PPOConfig(gpu_id: int = -1, epsilon: float = 0.1, gamma: float = 0.99, learning_rate: float = 0.00025, lmb: float = 0.95, entropy_coefficient: float = 0.01, value_coefficient: float = 1.0, actor_num: int = 8, epochs: int = 3, batch_size: int = 256, actor_timesteps: int = 128, total_timesteps: int = 10000, decrease_alpha: bool = True, timelimit_as_terminal: bool = False, seed: int = 1, preprocess_state: bool = True)[source]¶

Bases: AlgorithmConfig

PPOConfig List of configurations for PPO algorithm.

Parameters:

epsilon (float) – PPO’s probability ratio clipping range. Defaults to 0.1
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.
batch_size (int) – training batch size. Defaults to 256.
lmb (float) – scalar of lambda return’s computation in GAE. Defaults to 0.95.
entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.
value_coefficient (float) – scalar of value loss. Defaults to 1.0.
actor_num (int) – Number of parallel actors. Defaults to 8.
epochs (int) – Number of epochs to perform in each training iteration. Defaults to 3.
actor_timesteps (int) – Number of timesteps to interact with the environment by the actors. Defaults to 128.
total_timesteps (int) – Total number of timesteps to interact with the environment. Defaults to 10000.
decrease_alpha (bool) – Flag to control whether to decrease the learning rate linearly during the training. Defaults to True.
timelimit_as_terminal (bool) –
Treat as done if the environment reaches the timelimit. Defaults to False.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

class nnabla_rl.algorithms.ppo.PPO(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.ppo.PPOConfig = PPOConfig(gpu_id=-1, epsilon=0.1, gamma=0.99, learning_rate=0.00025, lmb=0.95, entropy_coefficient=0.01, value_coefficient=1.0, actor_num=8, epochs=3, batch_size=256, actor_timesteps=128, total_timesteps=10000, decrease_alpha=True, timelimit_as_terminal=False, seed=1, preprocess_state=True), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.ppo.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.ppo.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = <nnabla_rl.algorithms.ppo.DefaultPreprocessorBuilder object>)[source]¶

Bases: Algorithm

Proximal Policy Optimization (PPO) algorithm implementation.

This class implements the Proximal Policy Optimization (PPO) algorithm proposed by J. Schulman, et al. in the paper: “Proximal Policy Optimization Algorithms” For detail see: https://arxiv.org/abs/1707.06347

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (PPOConfig) – configuration of PPO algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
policy_solver_builder (SolverBuilder) – builder for policy solvers
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

QRDQN¶

class nnabla_rl.algorithms.qrdqn.QRDQNConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: int = 4, target_update_frequency: int = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, num_quantiles: int = 200, kappa: float = 1.0, unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

List of configurations for QRDQN algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.
batch_size (int) – training batch size. Defaults to 32.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
num_quantiles (int) – Number of quantile points. Defaults to 200.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
unroll_steps (int) – Number of steps to unroll tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
burn_in_steps (int) – Number of burn-in steps to initiaze recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
reset_rnn_on_terminal (bool) – Reset recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.

class nnabla_rl.algorithms.qrdqn.QRDQN(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.qrdqn.QRDQNConfig = QRDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, num_quantiles=200, kappa=1.0, unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True), quantile_dist_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.QuantileDistributionFunction] = <nnabla_rl.algorithms.qrdqn.DefaultQuantileBuilder object>, quantile_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrdqn.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.qrdqn.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.qrdqn.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Quantile Regression DQN algorithm.

This class implements the Quantile Regression DQN algorithm proposed by W. Dabney, et al. in the paper: “Distributional Reinforcement Learning with Quantile Regression” For details see: https://arxiv.org/abs/1710.10044

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (QRDQNConfig) – configuration of QRDQN algorithm
quantile_dist_function_builder (ModelBuilder[QuantileDistributionFunction]) – builder of quantile distribution function models
quantile_solver_builder (SolverBuilder) – builder for quantile distribution function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

QRSAC¶

class nnabla_rl.algorithms.qrsac.QRSACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: float | None = None, initial_temperature: float | None = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, num_quantiles: int = 32, kappa: float = 1.0, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

QRSACConfig List of configurations for QRSAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
num_quantiles (int) – Number of quantile points. Defaults to 32.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.qrsac.QRSAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.qrsac.QRSACConfig = QRSACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, num_quantiles=32, kappa=1.0, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True), quantile_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.QuantileDistributionFunction] = <nnabla_rl.algorithms.qrsac.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrsac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.qrsac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrsac.DefaultSolverBuilder object>, temperature_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrsac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.qrsac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.qrsac.DefaultExplorerBuilder object>)[source]¶

Quantile Regression Soft Actor-Critic (QR-SAC) algorithm.

This class implements the Quantile Regression Soft Actor Critic (QR-SAC) algorithm proposed by P. Wurman, et al. in the paper: “Outracing champion Gran Turismo drivers with deep reinforcement learning” For details see: https://www.nature.com/articles/s41586-021-04357-7

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (QRSACConfig) – configuration of the QRSAC algorithm
quantile_function_builder (ModelBuilder[QuantileDistributionFunction]) – buider of state-action quantile function models
quantile_solver_builder (SolverBuilder) – builder for state action quantile function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
temperature_solver_builder (SolverBuilder) – builder of temperature solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

QtOpt (ICRA 2018 version)¶

class nnabla_rl.algorithms.icra2018_qtopt.ICRA2018QtOpt(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.icra2018_qtopt.ICRA2018QtOptConfig = ICRA2018QtOptConfig(gpu_id=-1, gamma=0.9, learning_rate=0.001, batch_size=64, num_steps=1, learner_update_frequency=1, target_update_frequency=50, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.0, grad_clip=(-1.0, 1.0), unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True, q_loss_scalar=0.5, cem_initial_mean=None, cem_initial_variance=None, cem_sample_size=64, cem_num_elites=10, cem_alpha=0.0, cem_num_iterations=3, random_sample_size=16), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icra2018_qtopt.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultExplorerBuilder object>)[source]¶

Bases: DDQN

DQN algorithm for a continuous action environment.

This class implements the Deep Q-Network (DQN) algorithm for a continuous action environment. proposed by D Quillen, et al. in the paper: ‘Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods’ For details see: https://arxiv.org/pdf/1802.10264.pdf

This algorithm is a simple version of QtOpt, referring to https://arxiv.org/abs/1806.10293.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DQNConfig) – the parameter for DQN training
q_func_builder (ModelBuilder) – builder of q function model
q_solver_builder (SolverBuilder) – builder of q function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

class nnabla_rl.algorithms.icra2018_qtopt.ICRA2018QtOpt(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.icra2018_qtopt.ICRA2018QtOptConfig = ICRA2018QtOptConfig(gpu_id=-1, gamma=0.9, learning_rate=0.001, batch_size=64, num_steps=1, learner_update_frequency=1, target_update_frequency=50, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.0, grad_clip=(-1.0, 1.0), unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True, q_loss_scalar=0.5, cem_initial_mean=None, cem_initial_variance=None, cem_sample_size=64, cem_num_elites=10, cem_alpha=0.0, cem_num_iterations=3, random_sample_size=16), q_func_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icra2018_qtopt.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icra2018_qtopt.DefaultExplorerBuilder object>)[source]¶

Bases: DDQN

DQN algorithm for a continuous action environment.

This class implements the Deep Q-Network (DQN) algorithm for a continuous action environment. proposed by D Quillen, et al. in the paper: ‘Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods’ For details see: https://arxiv.org/pdf/1802.10264.pdf

This algorithm is a simple version of QtOpt, referring to https://arxiv.org/abs/1806.10293.pdf

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (DQNConfig) – the parameter for DQN training
q_func_builder (ModelBuilder) – builder of q function model
q_solver_builder (SolverBuilder) – builder of q function solver
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

Rainbow¶

class nnabla_rl.algorithms.rainbow.RainbowConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 6.25e-05, batch_size: int = 32, num_steps: int = 3, start_timesteps: int = 20000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 8000, max_explore_steps: int = 1000000, initial_epsilon: float = 0.0, final_epsilon: float = 0.0, test_epsilon: float = 0.0, v_min: float = -10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean', unroll_steps: int = 1, burn_in_steps: int = 0, reset_rnn_on_terminal: bool = True, alpha: float = 0.5, beta: float = 0.4, betasteps: int = 12500000, warmup_random_steps: int = 0, no_double: bool = False)[source]¶

Bases: CategoricalDDQNConfig

RainbowConfig List of configurations for Rainbow algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025 / 4.
batch_size (int) – training batch size. Defaults to 32.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 20000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (float) – the interval of learner update. Defaults to 4.
target_update_frequency (float) – the interval of target q-function update. Defaults to 8000.
v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.
v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.
num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.
num_steps (int) – the of steps to look ahead in n-step Q learning. Defaults to 3.
alpha (float) – priority exponent (written as omega in the rainbow paper) of prioritized buffer. Defaults to 0.5.
beta (float) – initial value of importance sampling exponent of prioritized buffer. Defaults to 0.4.
betasteps (int) – importance sampling exponent increase steps. After betasteps, exponent will get to 1.0. Defaults to 12500000.
warmup_random_steps (Optional[int]) – steps until this value will NOT use trained policy for exploration. Will explore with randomly selected action. Defaults to 0.
no_double (bool) – If true, following normal Q-learning style q value target will be used for categorical q value update. \(r + \gamma\max_{a}{Q_{\text{target}}(s_{t+1}, a)}\). Defaults to False.

class nnabla_rl.algorithms.rainbow.Rainbow(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.rainbow.RainbowConfig = RainbowConfig(gpu_id=-1, gamma=0.99, learning_rate=6.25e-05, batch_size=32, num_steps=3, start_timesteps=20000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=8000, max_explore_steps=1000000, initial_epsilon=0.0, final_epsilon=0.0, test_epsilon=0.0, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean', unroll_steps=1, burn_in_steps=0, reset_rnn_on_terminal=True, alpha=0.5, beta=0.4, betasteps=12500000, warmup_random_steps=0, no_double=False), value_distribution_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.rainbow.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.rainbow.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.rainbow.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.rainbow.DefaultExplorerBuilder object>)[source]¶

Bases: CategoricalDDQN

Rainbow algorithm. This class implements the Rainbow algorithm proposed by M. Bellemare, et al. in the paper: “Rainbow: Combining Improvements in Deep Reinforcement Learning” For details see: https://arxiv.org/abs/1710.02298

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (RainbowConfig) – configuration of the Rainbow algorithm
value_distribution_builder (ModelBuilder[ValueDistributionFunctionFunction]) – builder of value distribution function models
value_distribution_solver_builder (SolverBuilder) – builder of value distribution function solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

REINFORCE¶

class nnabla_rl.algorithms.reinforce.REINFORCEConfig(gpu_id: int = -1, reward_scale: float = 0.01, num_rollouts_per_train_iteration: int = 10, learning_rate: float = 0.001, clip_grad_norm: float = 1.0, fixed_ln_var: float = -2.3025850929940455)[source]¶

Bases: AlgorithmConfig

List of configurations for REINFORCE algorithm.

Parameters:

reward_scale (float) – Scale of reward. Defaults to 0.01.
num_rollouts_per_train_iteration (int) – Number of rollout per each training iteration for collecting on-policy experinces.Increasing this step size is effective to get precise parameters of policy function updating, but computational time of each iteration will increase. Defaults to 10.
learning_rate (float) – Learning rate which is set to the solvers of policy function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
clip_grad_norm (float) – Clip to the norm of gradient to this value. Defaults to 1.0.
fixed_ln_var (float) – Fixed log variance of the policy. This configuration is only valid when the enviroment is continuous. Defaults to 1.0.

class nnabla_rl.algorithms.reinforce.REINFORCE(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.reinforce.REINFORCEConfig = REINFORCEConfig(gpu_id=-1, reward_scale=0.01, num_rollouts_per_train_iteration=10, learning_rate=0.001, clip_grad_norm=1.0, fixed_ln_var=-2.3025850929940455), policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.reinforce.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.reinforce.DefaultSolverBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.reinforce.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Episodic REINFORCE implementation.

This class implements the episodic REINFORCE algorithm proposed by Ronald J. Williams. in the paper: “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” For detail see: https://link.springer.com/content/pdf/10.1007/BF00992696.pdf

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (REINFORCEConfig) – configuration of REINFORCE algorithm
policy_builder (ModelBuilder[StochasicPolicy]) – builder for policy function solvers
policy_builder – builder of policy models
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

SAC¶

class nnabla_rl.algorithms.sac.SACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: float | None = None, initial_temperature: float | None = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

SACConfig List of configurations for SAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.sac.SAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.sac.SACConfig = SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.sac.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Soft Actor-Critic (SAC) algorithm implementation.

This class implements the extended version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic Algorithms and Applications” For detail see: https://arxiv.org/abs/1812.05905

This algorithm is slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1801.01290

The temperature parameter is adjusted automatically instead of providing reward scalar as a hyper parameter.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (SACConfig) – configuration of the SAC algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
temperature_solver_builder (SolverBuilder) – builder of temperature solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

SAC (ICML 2018 version)¶

class nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, reward_scalar: float = 5.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1, num_steps: int = 1, pi_unroll_steps: int = 1, pi_burn_in_steps: int = 0, pi_reset_rnn_on_terminal: bool = True, q_unroll_steps: int = 1, q_burn_in_steps: int = 0, q_reset_rnn_on_terminal: bool = True, v_unroll_steps: int = 1, v_burn_in_steps: int = 0, v_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

ICML2018SACConfig List of configurations for ICML2018SAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
reward_scalar (float) – Reward scaling factor. Obtained reward will be multiplied by this value. Defaults to 5.0.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
target_update_interval (float) – the interval of target v function parameter’s update. Defaults to 1.
pi_unroll_steps (int) – Number of steps to unroll policy’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
pi_burn_in_steps (int) – Number of burn-in steps to initiaze policy’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
pi_reset_rnn_on_terminal (bool) – Reset policy’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
q_unroll_steps (int) – Number of steps to unroll q-function’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
q_burn_in_steps (int) – Number of burn-in steps to initiaze q-function’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
q_reset_rnn_on_terminal (bool) – Reset q-function’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
v_unroll_steps (int) – Number of steps to unroll v-function’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
v_burn_in_steps (int) – Number of burn-in steps to initiaze v-function’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
v_reset_rnn_on_terminal (bool) – Reset v-function’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.icml2018_sac.ICML2018SAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig = ICML2018SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, reward_scalar=5.0, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1, num_steps=1, pi_unroll_steps=1, pi_burn_in_steps=0, pi_reset_rnn_on_terminal=True, q_unroll_steps=1, q_burn_in_steps=0, q_reset_rnn_on_terminal=True, v_unroll_steps=1, v_burn_in_steps=0, v_reset_rnn_on_terminal=True), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2018_sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Soft Actor-Critic (SAC) algorithm.

This class implements the ICML2018 version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” For detail see: https://arxiv.org/abs/1801.01290

This implementation slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1812.05905

You will need to scale the reward received from the environment properly to get the algorithm work.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (ICML2018SACConfig) – configuration of the ICML2018SAC algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder of v function solvers
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

SAC-D¶

class nnabla_rl.algorithms.sacd.SACDConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: float | None = None, initial_temperature: float | None = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True, reward_dimension: int = 1)[source]¶

Bases: SACConfig

SACDConfig List of configurations for SACD algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
reward_dimension (int) – Number of reward components to learn.

class nnabla_rl.algorithms.sacd.SACD(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.sacd.SACDConfig = SACDConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True, reward_dimension=1), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sacd.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.sac.DefaultExplorerBuilder object>)[source]¶

Bases: SAC

Soft Actor-Critic Decomposition (SAC-D) algorithm implementation.

This class implements the factored version of Soft Actor Critic (SAC) algorithm proposed by J. MacGlashan, et al. in the paper: “Value Function Decomposition for Iterative Design of Reinforcement Learning Agents” For detail see: https://arxiv.org/abs/2206.13901

This algorithm trains factored Q-function to preserve factored reward information.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (SACDConfig) – configuration of the SACD algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
temperature_solver_builder (SolverBuilder) – builder of temperature solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

SRSAC¶

class nnabla_rl.algorithms.srsac.SRSACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: float | None = None, initial_temperature: float | None = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True, replay_ratio: int = 1, reset_interval: int = 2560000)[source]¶

Bases: SACConfig

SRSACConfig List of configurations for SRSAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1. Keep this value to 1 and use replay_ratio to control the number of updates in SRSAC.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
replay_ratio (int) – Number of updates per environment step. Defaults to 1.
reset_interval (int) – Paramerters will be reset every this number of updates. Defaults to 2560000.

class nnabla_rl.algorithms.srsac.SRSAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.srsac.SRSACConfig = SRSACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True, replay_ratio=1, reset_interval=2560000), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.sac.DefaultExplorerBuilder object>)[source]¶

Bases: SAC

Scaled-by-Resetting Soft Actor-Critic (SRSAC) algorithm implementation.

This class implements Scaled-by-Restting Soft Actor Critic (SRSAC) algorithm proposed by P. D’Oro, et al. in the paper: “Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier”. For details see: https://openreview.net/forum?id=OpC-9aBBVJe

This algorithm periodically resets the models and optimizers’ parameters for stable and efficient learning.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (SRSACConfig) – configuration of the SRSAC algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
temperature_solver_builder (SolverBuilder) – builder of temperature solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

SRSAC(Computationally efficient ver.)¶

class nnabla_rl.algorithms.srsac.EfficientSRSACConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: float | None = None, initial_temperature: float | None = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = False, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = False, replay_ratio: int = 1, reset_interval: int = 2560000)[source]¶

Bases: SRSACConfig

EfficientSRSACConfig List of configurations for EfficientSRSAC algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1. Keep this value to 1 and use replay_ratio to control the number of updates in SRSAC.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – Not supported. This configuration does not take effect in the training.
actor_unroll_steps (int) – Not supported. This configuration does not take effect in the training.
actor_burn_in_steps (int) – Not supported. This configuration does not take effect in the training.
actor_reset_rnn_on_terminal (bool) – Not supported. This configuration does not take effect in the training.
critic_unroll_steps (int) – Not supported. This configuration does not take effect in the training.
critic_burn_in_steps (int) – Not supported. This configuration does not take effect in the training.
critic_reset_rnn_on_terminal (bool) – Not supported. This configuration does not take effect in the training.
replay_ratio (int) – Number of updates per environment step.
reset_interval (int) – Paramerters will be reset every this number of updates.

class nnabla_rl.algorithms.srsac.EfficientSRSAC(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.srsac.EfficientSRSACConfig = EfficientSRSACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=False, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=False, replay_ratio=1, reset_interval=2560000), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.sac.DefaultExplorerBuilder object>)[source]¶

Bases: SRSAC

Efficient implementation of Scaled-by-Resetting Soft Actor-Critic (SRSAC) algorithm.

This class implements a computationally efficient version of Scaled-by-Restting Soft Actor Critic (SRSAC) algorithm proposed by P. D’Oro, et al. in the paper: “Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier”.

For details see: https://openreview.net/forum?id=OpC-9aBBVJe

This implementation does not support recurrent networks. For recurrent network support use SRSAC class.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (SRSACConfig) – configuration of the SRSAC algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
temperature_solver_builder (SolverBuilder) – builder of temperature solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

TD3¶

class nnabla_rl.algorithms.td3.TD3Config(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, d: int = 2, exploration_noise_sigma: float = 0.1, train_action_noise_sigma: float = 0.2, train_action_noise_abs: float = 0.5, num_steps: int = 1, actor_unroll_steps: int = 1, actor_burn_in_steps: int = 0, actor_reset_rnn_on_terminal: bool = True, critic_unroll_steps: int = 1, critic_burn_in_steps: int = 0, critic_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

TD3Config List of configurations for TD3 algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.003.
batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
d (int) – Interval of the policy update. The policy will be updated every d q-function updates. Defaults to 2.
exploration_noise_sigma (float) – Standard deviation of the gaussian exploration noise. Defaults to 0.1.
train_action_noise_sigma (float) – Standard deviation of the gaussian action noise used in the training. Defaults to 0.2.
train_action_noise_abs (float) – Absolute limit value of action noise used in the training. Defaults to 0.5.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
actor_unroll_steps (int) – Number of steps to unroll actor’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
actor_burn_in_steps (int) – Number of burn-in steps to initiaze actor’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
actor_reset_rnn_on_terminal (bool) – Reset actor’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.
critic_unroll_steps (int) – Number of steps to unroll critic’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
critic_burn_in_steps (int) – Number of burn-in steps to initiaze critic’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
critic_reset_rnn_on_terminal (bool) – Reset critic’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to False.

class nnabla_rl.algorithms.td3.TD3(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.td3.TD3Config = TD3Config(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, d=2, exploration_noise_sigma=0.1, train_action_noise_sigma=0.2, train_action_noise_abs=0.5, num_steps=1, actor_unroll_steps=1, actor_burn_in_steps=0, actor_reset_rnn_on_terminal=True, critic_unroll_steps=1, critic_burn_in_steps=0, critic_reset_rnn_on_terminal=True), critic_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.td3.DefaultCriticBuilder object>, critic_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, actor_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.td3.DefaultActorBuilder object>, actor_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.td3.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.td3.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Twin Delayed Deep Deterministic policy gradient (TD3) algorithm.

This class implements the Twin Delayed Deep Deteministic policy gradien (TD3) algorithm proposed by S. Fujimoto, et al. in the paper: “Addressing Function Approximation Error in Actor-Critic Methods” For detail see: https://arxiv.org/abs/1802.09477

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (TD3Config) – configuration of the TD3 algorithm
critic_builder (ModelBuilder[QFunction]) – builder of critic models
critic_solver_builder (SolverBuilder) – builder of critic solvers
actor_builder (ModelBuilder[DeterministicPolicy]) – builder of actor models
actor_solver_builder (SolverBuilder) – builder of actor solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

TRPO¶

class nnabla_rl.algorithms.trpo.TRPOConfig(gpu_id: int = -1, gamma: float = 0.995, lmb: float = 0.97, num_steps_per_iteration: int = 5000, pi_batch_size: int = 5000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 20, vf_epochs: int = 5, vf_batch_size: int = 64, vf_learning_rate: float = 0.001, preprocess_state: bool = True, gpu_batch_size: int | None = None)[source]¶

Bases: AlgorithmConfig

List of configurations for TRPO algorithm.

Parameters:

gamma (float) – Discount factor of rewards. Defaults to 0.995.
lmb (float) – Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance.
num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 5000.
pi_batch_size (int) – Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 5000.
sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.
maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.
conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.
conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 20.
vf_epochs (int) – Number of epochs in each iteration. Defaults to 5.
vf_batch_size (int) – Training batch size of value function. Defaults to 64.
vf_learning_rate (float) – Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.
gpu_batch_size (int, optional) – Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None.

class nnabla_rl.algorithms.trpo.TRPO(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.trpo.TRPOConfig = TRPOConfig(gpu_id=-1, gamma=0.995, lmb=0.97, num_steps_per_iteration=5000, pi_batch_size=5000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=20, vf_epochs=5, vf_batch_size=64, vf_learning_rate=0.001, preprocess_state=True, gpu_batch_size=None), v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.trpo.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.trpo.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.trpo.DefaultPolicyBuilder object>, state_preprocessor_builder: ~nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder | None = <nnabla_rl.algorithms.trpo.DefaultPreprocessorBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.trpo.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Trust Region Policy Optimiation method with Generalized Advantage Estimation (GAE) implementation.

This class implements the Trust Region Policy Optimiation (TRPO) with Generalized Advantage Estimation (GAE) algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” and “High-Dimensional Continuous Control Using Generalized Advantage Estimation” For detail see: https://arxiv.org/abs/1502.05477 and https://arxiv.org/abs/1506.02438

This algorithm only supports online training.

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (PPOConfig) – configuration of TRPO algorithm
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder for v function solvers
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
state_preprocessor_builder (None or PreprocessorBuilder) – state preprocessor builder to preprocess the states
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

TRPO (ICML 2015 version)¶

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig(gpu_id: int = -1, gamma: float = 0.99, num_steps_per_iteration: int = 100000, batch_size: int = 100000, gpu_batch_size: int | None = None, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.001, conjugate_gradient_iterations: int = 10)[source]¶

Bases: AlgorithmConfig

List of configurations for ICML2015TRPO algorithm.

Parameters:

gamma (float) – Discount factor of rewards. Defaults to 0.99.
num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 100000.
batch_size (int) – Trainig batch size of policy. Usually, batch_size is the same as num_steps_per_iteration. Defaults to 100000.
gpu_batch_size (int, optional) – Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None.
sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.
maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.
conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.
conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 20.

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPO(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig = ICML2015TRPOConfig(gpu_id=-1, gamma=0.99, num_steps_per_iteration=100000, batch_size=100000, gpu_batch_size=None, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.001, conjugate_gradient_iterations=10), policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2015_trpo.DefaultPolicyBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icml2015_trpo.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

Trust Region Policy Optimiation method with Single Path algorithm.

This class implements the Trust Region Policy Optimiation (TRPO) with Single Path algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” For detail see: https://arxiv.org/abs/1502.05477

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (ICML2015TRPOConfig) – configuration of ICML2015TRPO algorithm
policy_builder (ModelBuilder[StochasicPolicy]) – builder of policy models
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]

XQL (eXtreme Q-Learning)¶

class nnabla_rl.algorithms.xql.XQLConfig(gpu_id: int = -1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, value_temperature: float = 2.0, policy_temperature: float = 2.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, num_steps: int = 1, pi_unroll_steps: int = 1, pi_burn_in_steps: int = 0, pi_reset_rnn_on_terminal: bool = True, q_unroll_steps: int = 1, q_burn_in_steps: int = 0, q_reset_rnn_on_terminal: bool = True, v_unroll_steps: int = 1, v_burn_in_steps: int = 0, v_reset_rnn_on_terminal: bool = True)[source]¶

Bases: AlgorithmConfig

XQLConfig List of configurations for XQL algorithm.

Parameters:

gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.
batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
value_temperature (float) – Temperature parameter used to balance between the reward and kl-divergence (a.k.a beta). This parameter will be used in the training of value function. Defaults to 2.0. Theoretically, value temperature and policy temperature should be the same. But these two values are introduced for engineering purpose.
policy_temperature (float) – Temperature parameter used to balance between the reward and kl-divergence (a.k.a beta). This parameter will be used in the training of policy function. Defaults to 2.0. Theoretically, value temperature and policy temperature should be the same. But these two values are introduced for engineering purpose.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
num_steps (int) – number of steps for N-step Q targets. Defaults to 1.
pi_unroll_steps (int) – Number of steps to unroll policy’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
pi_burn_in_steps (int) – Number of burn-in steps to initiaze policy’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
pi_reset_rnn_on_terminal (bool) – Reset policy’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.
q_unroll_steps (int) – Number of steps to unroll q-function’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
q_burn_in_steps (int) – Number of burn-in steps to initiaze q-function’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
q_reset_rnn_on_terminal (bool) – Reset q-function’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.
v_unroll_steps (int) – Number of steps to unroll v-function’s tranining network. The network will be unrolled even though the provided model doesn’t have RNN layers. Defaults to 1.
v_burn_in_steps (int) – Number of burn-in steps to initiaze v-function’s recurrent layer states during training. This flag does not take effect if given model is not an RNN model. Defaults to 0.
v_reset_rnn_on_terminal (bool) – Reset v-function’s recurrent internal states to zero during training if episode ends. This flag does not take effect if given model is not an RNN model. Defaults to True.

class nnabla_rl.algorithms.xql.XQL(env_or_env_info: ~gym.core.Env | ~nnabla_rl.environments.environment_info.EnvironmentInfo, config: ~nnabla_rl.algorithms.xql.XQLConfig = XQLConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, value_temperature=2.0, policy_temperature=2.0, start_timesteps=10000, replay_buffer_size=1000000, num_steps=1, pi_unroll_steps=1, pi_burn_in_steps=0, pi_reset_rnn_on_terminal=True, q_unroll_steps=1, q_burn_in_steps=0, q_reset_rnn_on_terminal=True, v_unroll_steps=1, v_burn_in_steps=0, v_reset_rnn_on_terminal=True), q_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.xql.DefaultQFunctionBuilder object>, q_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.xql.DefaultSolverBuilder object>, v_function_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.xql.DefaultVFunctionBuilder object>, v_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.xql.DefaultSolverBuilder object>, policy_builder: ~nnabla_rl.builders.model_builder.ModelBuilder[~nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.xql.DefaultPolicyBuilder object>, policy_solver_builder: ~nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.xql.DefaultSolverBuilder object>, replay_buffer_builder: ~nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.xql.DefaultReplayBufferBuilder object>, explorer_builder: ~nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.xql.DefaultExplorerBuilder object>)[source]¶

Bases: Algorithm

EXtreme Q-Learning (XQL) algorithm implementation.

This class implements the eXtreme Q-Learning (XQL) algorithm proposed by D. Garg, et. al. in the paper: “Extreme Q-Learning: MaxEnt RL without Entropy” For detail see: https://arxiv.org/abs/2301.02328

Parameters:

env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info
config (XQLConfig) – configuration of the XQL algorithm
q_function_builder (ModelBuilder[QFunction]) – builder of q function models
q_solver_builder (SolverBuilder) – builder of q function solvers
v_function_builder (ModelBuilder[VFunction]) – builder of v function models
v_solver_builder (SolverBuilder) – builder of v function solvers
policy_builder (ModelBuilder[StochasticPolicy]) – builder of actor models
policy_solver_builder (SolverBuilder) – builder of policy solvers
replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer
explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters:

state (np.ndarray) – state to compute the action.
begin_of_episode (bool) – Used for rnn state resetting. This flag informs the beginning of episode.
extra_info (Dict[str, Any]) – Dictionary to provide extra information to compute the action. Most of the algorithm does not use this field.

Returns:

Action for given state using current trained policy.

Return type:

np.ndarray

classmethod is_rnn_supported()[source]¶

Check whether the algorithm supports rnn models or not.

Returns:: True if the algorithm supports rnn models. Otherwise False.
Return type:: bool

classmethod is_supported_env(env_or_env_info)[source]¶

Check whether the algorithm supports the enviroment or not.

Parameters:: env_or_env_info (gym.Env or EnvironmentInfo) – environment or environment info
Returns:: True if the algorithm supports the environment. Otherwise False.
Return type:: bool

property latest_iteration_state¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns:: Dictionary with items of training process state.
Return type:: Dict[str, Any]