
All algorithm are derived from nnabla_rl.algorithm.Algorithm.


Algorithm will run on cpu by default (No matter what nnabla context is set in prior to the instantiation). If you want to run the algorithm on gpu, set the gpu_id through the algorithm’s config. Note that the algorithm will override the nnabla context when the training starts.


class nnabla_rl.algorithm.AlgorithmConfig(gpu_id: int = - 1)[source]

List of algorithm common configuration


gpu_id (int) – id of the gpu to use. If negative, the training will run on cpu. Defaults to -1.

class nnabla_rl.algorithm.Algorithm(env_info, config=AlgorithmConfig(gpu_id=- 1))[source]

Base Algorithm class



Default functions, solvers and configurations are set to the configurations of each algorithm’s original paper. Default functions may not work depending on the environment.

abstract compute_eval_action(state)numpy.array[source]

Compute action for given state using current best policy. This is usually used for evaluation.


state (np.ndarray) – state to compute the action.


Action for given state using current trained policy.

Return type


property iteration_num: int

Current iteration number.


Current iteration number of running training.

Return type


property latest_iteration_state: Dict[str, Any]

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

property max_iterations: int

Maximum iteration number of running training.


Maximum iteration number of running training.

Return type


set_hooks(hooks: Sequence[nnabla_rl.hook.Hook])[source]

Set hooks for running additional operation during training. Previously set hooks will be removed and replaced with new hooks.


hooks (list of nnabla_rl.hook.Hook) – Hooks to invoke during training

train(env_or_buffer: Union[gym.core.Env, nnabla_rl.replay_buffer.ReplayBuffer], total_iterations: int)[source]

Train the policy with reinforcement learning algorithm

  • env_or_buffer (Union[gym.Env, ReplayBuffer]) – Target environment to train the policy online or reply buffer to train the policy offline.

  • total_iterations (int) – Total number of iterations to train the policy.


UnsupportedTrainingException – Raises if this algorithm does not support the training method for given parameter.

train_offline(replay_buffer: gym.core.Env, total_iterations: int)[source]

Train the policy using only the replay buffer.

  • replay_buffer (ReplayBuffer) – Replay buffer to sample experiences to train the policy.

  • total_iterations (int) – Total number of iterations to train the policy.


UnsupportedTrainingException – Raises if the algorithm does not support offline training

train_online(train_env: gym.core.Env, total_iterations: int)[source]

Train the policy by interacting with given environment.

  • train_env (gym.Env) – Target environment to train the policy.

  • total_iterations (int) – Total number of iterations to train the policy.


UnsupportedTrainingException – Raises if the algorithm does not support online training


class nnabla_rl.algorithms.a2c.A2CConfig(gpu_id: int = - 1, gamma: float = 0.99, n_steps: int = 5, learning_rate: float = 0.0007, entropy_coefficient: float = 0.01, value_coefficient: float = 0.5, decay: float = 0.99, epsilon: float = 1e-05, start_timesteps: int = 1, actor_num: int = 8, timelimit_as_terminal: bool = False, max_grad_norm: Optional[float] = 0.5, seed: int = - 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for A2C algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • n_steps (int) – number of rollout steps. Defaults to 5.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0007.

  • entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.

  • value_coefficient (float) – scalar of value loss. Defaults to 0.5.

  • decay (float) – decay parameter of Adam solver. Defaults to 0.99.

  • epsilon (float) – epislon of Adam solver. Defaults to 0.00001.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 1.

  • actor_num (int) – number of parallel actors. Defaults to 8.

  • timelimit_as_terminal (bool) – Treat as done if the environment reaches the timelimit. Defaults to False.

  • max_grad_norm (float) – threshold value for clipping gradient. Defaults to 0.5.

  • seed (int) – base seed of random number generator used by the actors. Defaults to 1.

class nnabla_rl.algorithms.a2c.A2C(env_or_env_info, v_function_builder:[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.a2c.DefaultVFunctionBuilder object>, v_solver_builder: = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.a2c.DefaultPolicyBuilder object>, policy_solver_builder: = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, config=A2CConfig(gpu_id=-1, gamma=0.99, n_steps=5, learning_rate=0.0007, entropy_coefficient=0.01, value_coefficient=0.5, decay=0.99, epsilon=1e-05, start_timesteps=1, actor_num=8, timelimit_as_terminal=False, max_grad_norm=0.5, seed=-1))[source]

Bases: nnabla_rl.algorithm.Algorithm

Advantage Actor-Critic (A2C) algorithm implementation.

This class implements the Advantage Actor-Critic (A2C) algorithm. A2C is the synchronous version of A3C, Asynchronous Advantage Actor-Critic. A3C was proposed by V. Mnih, et al. in the paper: “Asynchronous Methods for Deep Reinforcement Learning” For detail see:

This algorithm only supports online training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.bcq.BCQConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, phi: float = 0.05, num_q_ensembles: int = 2, num_action_samples: int = 10)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for BCQ algorithm

  • gamma (float) – discount factor of reward. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.

  • phi (float) – action perturbator noise coefficient. Defaults to 0.05.

  • num_q_ensembles (int) – number of q function ensembles . Defaults to 2.

  • num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.

class nnabla_rl.algorithms.bcq.BCQ(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bcq.BCQConfig = BCQConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, phi=0.05, num_q_ensembles=2, num_action_samples=10), q_function_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bcq.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, vae_builder:[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bcq.DefaultVAEBuilder object>, vae_solver_builder: = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, perturbator_builder:[nnabla_rl.models.perturbator.Perturbator] = <nnabla_rl.algorithms.bcq.DefaultPerturbatorBuilder object>, perturbator_solver_builder: = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Batch-Constrained Q-learning (BCQ) algorithm

This class implements the Batch-Constrained Q-learning (BCQ) algorithm proposed by S. Fujimoto, et al. in the paper: “Off-Policy Deep Reinforcement Learning without Exploration” For details see:

This algorithm only supports offline training.

  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (BCQConfig) – configuration of the BCQ algorithm

  • q_function_builder (ModelBuilder[QFunction]) – builder of q-function models

  • q_solver_builder (SolverBuilder) – builder for q-function solvers

  • vae_builder (ModelBuilder[VariationalAutoEncoder]) – builder of variational auto encoder models

  • vae_solver_builder (SolverBuilder) – builder for variational auto encoder solvers

  • perturbator_builder (PerturbatorBuilder) – builder of perturbator models

  • perturbator_solver_builder (SolverBuilder) – builder for perturbator solvers

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.bear.BEARConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, epsilon: float = 0.05, num_q_ensembles: int = 2, num_mmd_actions: int = 5, num_action_samples: int = 10, mmd_type: str = 'gaussian', mmd_sigma: float = 20.0, initial_lagrange_multiplier: Optional[float] = None, fix_lagrange_multiplier: bool = False, warmup_iterations: int = 20000, use_mean_for_eval: bool = False)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for BEAR algorithm.

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.

  • epsilon (float) – inequality constraint of dual gradient descent. Defaults to 0.05.

  • num_q_ensembles (int) – number of q ensembles . Defaults to 2.

  • num_mmd_actions (int) – number of actions to sample for computing maximum mean discrepancy (MMD). Defaults to 5.

  • num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.

  • mmd_type (str) – kernel type used for MMD computation. laplacian or gaussian is supported. Defaults to gaussian.

  • mmd_sigma (float) – parameter used for adjusting the MMD. Defaults to 20.0.

  • initial_lagrange_multiplier (float, optional) – Initial value of lagrange multiplier. If not specified, random value sampled from normal distribution will be used instead.

  • fix_lagrange_multiplier (bool) – Either to fix the lagrange multiplier or not. Defaults to False.

  • warmup_iterations (int) – Number of iterations until start updating the policy. Defaults to 20000

  • use_mean_for_eval (bool) – Use mean value instead of best action among the samples for evaluation. Defaults to False.

class nnabla_rl.algorithms.bear.BEAR(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bear.BEARConfig = BEARConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, epsilon=0.05, num_q_ensembles=2, num_mmd_actions=5, num_action_samples=10, mmd_type='gaussian', mmd_sigma=20.0, initial_lagrange_multiplier=None, fix_lagrange_multiplier=False, warmup_iterations=20000, use_mean_for_eval=False), q_function_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bear.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, pi_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.bear.DefaultPolicyBuilder object>, pi_solver_builder: = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, vae_builder:[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bear.DefaultVAEBuilder object>, vae_solver_builder: = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, lagrange_solver_builder: = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Bootstrapping Error Accumulation Reduction (BEAR) algorithm.

This class implements the Bootstrapping Error Accumulation Reduction (BEAR) algorithm proposed by A. Kumar, et al. in the paper: “Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction” For details see:

This algorithm only supports offline training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

Categorical DQN

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = - 10.0, v_max: float = 10.0, num_atoms: int = 51)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for CategoricalDQN algorithm.

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training atch size. Defaults to 32.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (float) – the interval of learner update. Defaults to 4

  • target_update_frequency (float) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.

  • v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.

  • num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51), value_distribution_builder:[nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Categorical DQN algorithm.

This class implements the Categorical DQN algorithm proposed by M. Bellemare, et al. in the paper: “A Distributional Perspective on Reinfocement Learning” For details see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.ddpg.DDPGConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for DDPG algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.

class nnabla_rl.algorithms.ddpg.DDPG(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ddpg.DDPGConfig = DDPGConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1), critic_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.ddpg.DefaultCriticBuilder object>, critic_solver_builder: = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, actor_builder:[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.ddpg.DefaultActorBuilder object>, actor_solver_builder: = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.ddpg.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Deep Deterministic Policy Gradient (DDPG) algorithm.

This class implements the modified version of the Deep Deterministic Policy Gradient (DDPG) algorithm proposed by T. P. Lillicrap, et al. in the paper: “Continuous control with deep reinforcement learning” For details see: We use gaussian noise instead of Ornstein-Uhlenbeck process to explore in the environment. The effectiveness of using gaussian noise for DDPG is reported in the paper: “Addressing Funciton Approximaiton Error in Actor-Critic Methods”. see


Compute action for given state using current best policy. This is usually used for evaluation.


state (np.ndarray) – state to compute the action.


Action for given state using current trained policy.

Return type


property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.dqn.DQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Optional[Tuple[float, float]] = (- 1.0, 1.0))[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for DQN algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.

  • batch_size (int) – training atch size. Defaults to 32.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.

  • grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).

class nnabla_rl.algorithms.dqn.DQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.dqn.DQNConfig = DQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0)), q_func_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

DQN algorithm.

This class implements the Deep Q-Network (DQN) algorithm proposed by V. Mnih, et al. in the paper: “Human-level control through deep reinforcement learning” For details see:

Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DQN. You can replace the solver by implementing a (SolverBuilder) and pass the solver on DQN class instantiation.

  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (DQNConfig) – the parameter for DQN training

  • q_func_builder (ModelBuilder) – builder of q function model

  • q_solver_builder (SolverBuilder) – builder of q function solver

  • replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.gail.GAILConfig(gpu_id: int = - 1, preprocess_state: bool = True, act_deterministic_in_eval: bool = True, discriminator_batch_size: int = 50000, discriminator_learning_rate: float = 0.01, discriminator_update_frequency: int = 1, adversary_entropy_coef: float = 0.001, policy_update_frequency: int = 1, gamma: float = 0.995, lmb: float = 0.97, pi_batch_size: int = 50000, num_steps_per_iteration: int = 50000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 10, vf_epochs: int = 5, vf_batch_size: int = 128, vf_learning_rate: float = 0.001)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

GAIL config :param act_deterministic_in_eval: Enable act deterministically at evalution. Defaults to True. :type act_deterministic_in_eval: bool :param discriminator_batch_size: Trainig batch size of discriminator. Usually, discriminator_batch_size is the same as pi_batch_size. Defaults to 50000. :type discriminator_batch_size: bool :param discriminator_learning_rate: Learning rate which is set to the solvers of dicriminator function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001. :type discriminator_learning_rate: float :param discriminator_update_frequency: Frequency (measured in the number of parameter update) of discriminator update. Defaults to 1. :type discriminator_update_frequency: int :param adversary_entropy_coef: Coefficient of entropy loss in dicriminator training. Defaults to 0.001. :type adversary_entropy_coef: float :param policy_update_frequency: Frequency (measured in the number of parameter update) of policy update. Defaults to 1. :type policy_update_frequency: int :param gamma: Discount factor of rewards. Defaults to 0.995. :type gamma: float :param lmb: Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance. :type lmb: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 50000. :type num_steps_per_iteration: int :param pi_batch_size: Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 50000. :type pi_batch_size: int :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 10. :type conjugate_gradient_iterations: int :param vf_epochs: Number of epochs in each iteration. Defaults to 5. :type vf_epochs: int :param vf_batch_size: Training batch size of value function. Defaults to 128. :type vf_batch_size: int :param vf_learning_rate: Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001. :type vf_learning_rate: float :param preprocess_state: Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True. :type preprocess_state: bool

class nnabla_rl.algorithms.gail.GAIL(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], expert_buffer: nnabla_rl.replay_buffer.ReplayBuffer, config: nnabla_rl.algorithms.gail.GAILConfig = GAILConfig(gpu_id=-1, preprocess_state=True, act_deterministic_in_eval=True, discriminator_batch_size=50000, discriminator_learning_rate=0.01, discriminator_update_frequency=1, adversary_entropy_coef=0.001, policy_update_frequency=1, gamma=0.995, lmb=0.97, pi_batch_size=50000, num_steps_per_iteration=50000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=10, vf_epochs=5, vf_batch_size=128, vf_learning_rate=0.001), v_function_builder:[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.gail.DefaultVFunctionBuilder object>, v_solver_builder: = <nnabla_rl.algorithms.gail.DefaultVFunctionSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.gail.DefaultPolicyBuilder object>, reward_function_builder:[nnabla_rl.models.reward_function.RewardFunction] = <nnabla_rl.algorithms.gail.DefaultRewardFunctionBuilder object>, reward_solver_builder: = <nnabla_rl.algorithms.gail.DefaultRewardFunctionSolverBuilder object>, state_preprocessor_builder: Optional[] = <nnabla_rl.algorithms.gail.DefaultPreprocessorBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Generative Adversarial Imitation Learning implementation.

This class implements the Generative Adversarial Imitation Learning (GAIL) algorithm proposed by Jonathan Ho, et al. in the paper: “Generative Adversarial Imitation Learning” For detail see:

This algorithm only supports online training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.iqn.IQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for IQN algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training atch size. Defaults to 32.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.

  • N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.

  • K (int) – Number of samples to compute greedy next action. Defaults to 32.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

  • embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.

class nnabla_rl.algorithms.iqn.IQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.iqn.IQNConfig = IQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64), quantile_function_builder:[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: = <nnabla_rl.algorithms.iqn.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.iqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Implicit Quantile Network algorithm.

This class implements the Implicit Quantile Network (IQN) algorithm proposed by W. Dabney, et al. in the paper: “Implicit Quantile Networks for Distributional Reinforcement Learning” For details see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

Munchausen DQN

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for Munchausen DQN algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training atch size. Defaults to 32.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.

  • munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.

  • clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig = MunchausenDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), q_func_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.munchausen_dqn.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.munchausen_dqn.DefaultQSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.munchausen_dqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Munchausen-DQN algorithm.

This class implements the Munchausen-DQN (Munchausen Deep Q Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

Munchausen IQN

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for Munchausen IQN algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training atch size. Defaults to 32.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.

  • N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.

  • K (int) – Number of samples to compute greedy next action. Defaults to 32.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

  • embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.

  • entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.

  • munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.

  • clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig = MunchausenIQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), risk_measure_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable] = <function risk_neutral_measure>, quantile_function_builder:[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.munchausen_iqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Munchausen-IQN algorithm implementation.

This class implements the Munchausen-IQN (Munchausen Implicit Quantile Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see:

  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (MunchausenIQNConfig) – configuration of MunchausenIQN algorithm

  • risk_measure_function (Callable[[nn.Variable], nn.Variable]) – risk measure function to apply to the quantiles.

  • quantile_function_builder (ModelBuilder[StateActionQuantileFunction]) – builder of state-action quantile function models

  • quantile_solver_builder (SolverBuilder) – builder for state action quantile function solvers

  • replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.ppo.PPOConfig(gpu_id: int = - 1, epsilon: float = 0.1, gamma: float = 0.99, learning_rate: float = 0.00025, lmb: float = 0.95, entropy_coefficient: float = 0.01, value_coefficient: float = 1.0, actor_num: int = 8, epochs: int = 3, batch_size: int = 256, actor_timesteps: int = 128, total_timesteps: int = 10000, decrease_alpha: bool = True, timelimit_as_terminal: bool = False, seed: int = 1, preprocess_state: bool = True)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for PPO algorithm

  • epsilon (float) – PPO’s probability ratio clipping range. Defaults to 0.1

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.

  • batch_size (int) – training batch size. Defaults to 256.

  • lmb (float) – scalar of lambda return’s computation in GAE. Defaults to 0.95.

  • entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.

  • value_coefficient (float) – scalar of value loss. Defaults to 1.0.

  • actor_num (int) – Number of parallel actors. Defaults to 8.

  • epochs (int) – Number of epochs to perform in each training iteration. Defaults to 3.

  • actor_timesteps (int) – Number of timesteps to interact with the environment by the actors. Defaults to 128.

  • total_timesteps (int) – Total number of timesteps to interact with the environment. Defaults to 10000.

  • decrease_alpha (bool) – Flag to control whether to decrease the learning rate linearly during the training. Defaults to True.

  • timelimit_as_terminal (bool) –

    Treat as done if the environment reaches the timelimit. Defaults to False.

  • seed (int) – base seed of random number generator used by the actors. Defaults to 1.

  • preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

class nnabla_rl.algorithms.ppo.PPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ppo.PPOConfig = PPOConfig(gpu_id=-1, epsilon=0.1, gamma=0.99, learning_rate=0.00025, lmb=0.95, entropy_coefficient=0.01, value_coefficient=1.0, actor_num=8, epochs=3, batch_size=256, actor_timesteps=128, total_timesteps=10000, decrease_alpha=True, timelimit_as_terminal=False, seed=1, preprocess_state=True), v_function_builder:[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.ppo.DefaultVFunctionBuilder object>, v_solver_builder: = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.ppo.DefaultPolicyBuilder object>, policy_solver_builder: = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, state_preprocessor_builder: Optional[] = <nnabla_rl.algorithms.ppo.DefaultPreprocessorBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Proximal Policy Optimization (PPO) algorithm implementation.

This class implements the Proximal Policy Optimization (PPO) algorithm proposed by J. Schulman, et al. in the paper: “Proximal Policy Optimization Algorithms” For detail see:

This algorithm only supports online training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.qrdqn.QRDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, learner_update_frequency: int = 4, target_update_frequency: int = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, num_quantiles: int = 200, kappa: float = 1.0)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for QRDQN algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training atch size. Defaults to 32.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • num_quantiles (int) – Number of quantile points. Defaults to 200.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

class nnabla_rl.algorithms.qrdqn.QRDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.qrdqn.QRDQNConfig = QRDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, num_quantiles=200, kappa=1.0), quantile_dist_function_builder:[nnabla_rl.models.distributional_function.QuantileDistributionFunction] = <nnabla_rl.algorithms.qrdqn.DefaultQuantileBuilder object>, quantile_solver_builder: = <nnabla_rl.algorithms.qrdqn.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.qrdqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Quantile Regression DQN algorithm.

This class implements the Quantile Regression DQN algorithm proposed by W. Dabney, et al. in the paper: “Distributional Reinforcement Learning with Quantile Regression” For details see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.reinforce.REINFORCEConfig(gpu_id: int = - 1, reward_scale: float = 0.01, num_rollouts_per_train_iteration: int = 10, learning_rate: float = 0.001, clip_grad_norm: float = 1.0, fixed_ln_var: float = - 2.3025850929940455)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

REINFORCE config :param reward_scale: Scale of reward. Defaults to 0.01. :type reward_scale: float :param num_rollouts_per_train_iteration: Number of rollout per each training iteration for collecting on-policy experinces.Increasing this step size is effective to get precise parameters of policy function updating, but computational time of each iteration will increase. Defaults to 10. :type num_rollouts_per_train_iteration: int :param learning_rate: Learning rate which is set to the solvers of policy function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001. :type learning_rate: float :param clip_grad_norm: Clip to the norm of gradient to this value. Defaults to 1.0. :type clip_grad_norm: float :param fixed_ln_var: Fixed log variance of the policy. This configuration is only valid when the enviroment is continuous. Defaults to 1.0. :type fixed_ln_var: float

class nnabla_rl.algorithms.reinforce.REINFORCE(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.reinforce.REINFORCEConfig = REINFORCEConfig(gpu_id=-1, reward_scale=0.01, num_rollouts_per_train_iteration=10, learning_rate=0.001, clip_grad_norm=1.0, fixed_ln_var=-2.3025850929940455), policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.reinforce.DefaultPolicyBuilder object>, policy_solver_builder: = <nnabla_rl.algorithms.reinforce.DefaultSolverBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

episodic REINFORCE implementation.

This class implements the episodic REINFORCE algorithm proposed by Ronald J. Williams. in the paper: “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” For detail see:

This algorithm only supports online training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.sac.SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: Optional[float] = None, initial_temperature: Optional[float] = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for SAC algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.

  • batch_size (int) – training batch size. Defaults to 256.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.

  • gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.

  • target_entropy (float, optional) – Target entropy value. Defaults to None.

  • initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.

  • fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

class nnabla_rl.algorithms.sac.SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.sac.SACConfig = SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000), q_function_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Soft Actor-Critic (SAC) algorithm implementation.

This class implements the extended version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic Algorithms and Applications” For detail see:

This algorithm is slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper:

The temperature parameter is adjusted automatically instead of providing reward scalar as a hyper parameter.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

SAC (ICML 2018 version)

class nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, reward_scalar: float = 5.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for ICML2018SAC algorithm.

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.

  • batch_size (int) – training batch size. Defaults to 256.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.

  • gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.

  • reward_scalar (float) – Reward scaling factor. Obtained reward will be multiplied by this value. Defaults to 5.0.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • target_update_interval (float) – the interval of target v function parameter’s update. Defaults to 1.

class nnabla_rl.algorithms.icml2018_sac.ICML2018SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig = ICML2018SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, reward_scalar=5.0, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1), v_function_builder:[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultVFunctionBuilder object>, v_solver_builder: = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, q_function_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultQFunctionBuilder object>, q_solver_builder: = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2018_sac.DefaultPolicyBuilder object>, policy_solver_builder: = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.icml2018_sac.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Soft Actor-Critic (SAC) algorithm.

This class implements the ICML2018 version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” For detail see:

This implementation slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper:

You will need to scale the reward received from the environment properly to get the algorithm work.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.td3.TD3Config(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, d: int = 2, exploration_noise_sigma: float = 0.1, train_action_noise_sigma: float = 0.2, train_action_noise_abs: float = 0.5)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for TD3 algorithm

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.003.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • d (int) – Interval of the policy update. The policy will be updated every d q-function updates. Defaults to 2.

  • exploration_noise_sigma (float) – Standard deviation of the gaussian exploration noise. Defaults to 0.1.

  • train_action_noise_sigma (float) – Standard deviation of the gaussian action noise used in the training. Defaults to 0.2.

  • train_action_noise_abs (float) – Absolute limit value of action noise used in the training. Defaults to 0.5.

class nnabla_rl.algorithms.td3.TD3(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.td3.TD3Config = TD3Config(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, d=2, exploration_noise_sigma=0.1, train_action_noise_sigma=0.2, train_action_noise_abs=0.5), critic_builder:[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.td3.DefaultCriticBuilder object>, critic_solver_builder: = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, actor_builder:[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.td3.DefaultActorBuilder object>, actor_solver_builder: = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, replay_buffer_builder: = <nnabla_rl.algorithms.td3.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Twin Delayed Deep Deterministic policy gradient (TD3) algorithm.

This class implements the Twin Delayed Deep Deteministic policy gradien (TD3) algorithm proposed by S. Fujimoto, et al. in the paper: “Addressing Function Approximation Error in Actor-Critic Methods” For detail see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]


class nnabla_rl.algorithms.trpo.TRPOConfig(gpu_id: int = - 1, gamma: float = 0.995, lmb: float = 0.97, num_steps_per_iteration: int = 5000, pi_batch_size: int = 5000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 20, vf_epochs: int = 5, vf_batch_size: int = 64, vf_learning_rate: float = 0.001, preprocess_state: bool = True, gpu_batch_size: Optional[int] = None)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

TRPO config :param gamma: Discount factor of rewards. Defaults to 0.995. :type gamma: float :param lmb: Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance. :type lmb: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 5000. :type num_steps_per_iteration: int :param pi_batch_size: Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 5000. :type pi_batch_size: int :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 20. :type conjugate_gradient_iterations: int :param vf_epochs: Number of epochs in each iteration. Defaults to 5. :type vf_epochs: int :param vf_batch_size: Training batch size of value function. Defaults to 64. :type vf_batch_size: int :param vf_learning_rate: Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001. :type vf_learning_rate: float :param preprocess_state: Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True. :type preprocess_state: bool :param gpu_batch_size: Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None. :type gpu_batch_size: int, optional

class nnabla_rl.algorithms.trpo.TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.trpo.TRPOConfig = TRPOConfig(gpu_id=-1, gamma=0.995, lmb=0.97, num_steps_per_iteration=5000, pi_batch_size=5000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=20, vf_epochs=5, vf_batch_size=64, vf_learning_rate=0.001, preprocess_state=True, gpu_batch_size=None), v_function_builder:[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.trpo.DefaultVFunctionBuilder object>, v_solver_builder: = <nnabla_rl.algorithms.trpo.DefaultSolverBuilder object>, policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.trpo.DefaultPolicyBuilder object>, state_preprocessor_builder: Optional[] = <nnabla_rl.algorithms.trpo.DefaultPreprocessorBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Trust Region Policy Optimiation method with Generalized Advantage Estimation (GAE) implementation.

This class implements the Trust Region Policy Optimiation (TRPO) with Generalized Advantage Estimation (GAE) algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” and “High-Dimensional Continuous Control Using Generalized Advantage Estimation” For detail see: and

This algorithm only supports online training.

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]

TRPO (ICML 2015 version)

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig(gpu_id: int = - 1, gamma: float = 0.99, num_steps_per_iteration: int = 100000, batch_size: int = 100000, gpu_batch_size: Optional[int] = None, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.001, conjugate_gradient_iterations: int = 10)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

ICML2015TRPO config :param gamma: Discount factor of rewards. Defaults to 0.99. :type gamma: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 100000. :type num_steps_per_iteration: int :param batch_size: Trainig batch size of policy. Usually, batch_size is the same as num_steps_per_iteration. Defaults to 100000. :type batch_size: int :param gpu_batch_size: Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None. :type gpu_batch_size: int, optional :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 20. :type conjugate_gradient_iterations: int

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig = ICML2015TRPOConfig(gpu_id=-1, gamma=0.99, num_steps_per_iteration=100000, batch_size=100000, gpu_batch_size=None, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.001, conjugate_gradient_iterations=10), policy_builder:[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2015_trpo.DefaultPolicyBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Trust Region Policy Optimiation method with Single Path algorithm.

This class implements the Trust Region Policy Optimiation (TRPO) with Single Path algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” For detail see:

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/ for getting more details.


Dictionary with items of training process state.

Return type

Dict[str, Any]