nnablaRL

nnablaRL is a deep reinforcement learning library built on top of Neural Network Libraries that is intended to be used for research, development and production.

Getting started

Installation

Installing nnabla_rl is easy

pip install nnabla_rl

If you would like to install nnabla_rl for development

cd <nnabla_rl root dir>
pip install -e .

API documentation

nnablaRL APIs

Algorithms

All algorithm are derived from nnabla_rl.algorithm.Algorithm.

Note

Algorithm will run on cpu by default (No matter what nnabla context is set in prior to the instantiation). If you want to run the algorithm on gpu, set the gpu_id through the algorithm’s config. Note that the algorithm will override the nnabla context when the training starts.

Algorithm

class nnabla_rl.algorithm.AlgorithmConfig(gpu_id: int = - 1)[source]

List of algorithm common configuration

Parameters

gpu_id (int) – id of the gpu to use. If negative, the training will run on cpu. Defaults to -1.

class nnabla_rl.algorithm.Algorithm(env_info, config=AlgorithmConfig(gpu_id=- 1))[source]

Base Algorithm class

Parameters

Note

Default functions, solvers and configurations are set to the configurations of each algorithm’s original paper. Default functions may not work depending on the environment.

abstract compute_eval_action(state) numpy.ndarray[source]

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

abstract classmethod is_supported_env(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo])[source]

Check whether the algorithm supports the enviroment or not.

Parameters

env_or_env_info (gym.Env or EnvironmentInfo) –

:param : environment or environment info

Returns

True if the algorithm supports the environment. Otherwise False.

Return type

bool

property iteration_num: int

Current iteration number.

Returns

Current iteration number of running training.

Return type

int

property latest_iteration_state: Dict[str, Any]

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

set_hooks(hooks: Sequence[nnabla_rl.hook.Hook])[source]

Set hooks for running additional operation during training. Previously set hooks will be removed and replaced with new hooks.

Parameters

hooks (list of nnabla_rl.hook.Hook) – Hooks to invoke during training

train(env_or_buffer: Union[gym.core.Env, nnabla_rl.replay_buffer.ReplayBuffer], total_iterations: int)[source]

Train the policy with reinforcement learning algorithm

Parameters
  • env_or_buffer (Union[gym.Env, ReplayBuffer]) – Target environment to train the policy online or reply buffer to train the policy offline.

  • total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if this algorithm does not support the training method for given parameter.

train_offline(replay_buffer: gym.core.Env, total_iterations: int)[source]

Train the policy using only the replay buffer.

Parameters
  • replay_buffer (ReplayBuffer) – Replay buffer to sample experiences to train the policy.

  • total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if the algorithm does not support offline training

train_online(train_env: gym.core.Env, total_iterations: int)[source]

Train the policy by interacting with given environment.

Parameters
  • train_env (gym.Env) – Target environment to train the policy.

  • total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if the algorithm does not support online training

A2C

class nnabla_rl.algorithms.a2c.A2CConfig(gpu_id: int = - 1, gamma: float = 0.99, n_steps: int = 5, learning_rate: float = 0.0007, entropy_coefficient: float = 0.01, value_coefficient: float = 0.5, decay: float = 0.99, epsilon: float = 1e-05, start_timesteps: int = 1, actor_num: int = 8, timelimit_as_terminal: bool = False, max_grad_norm: Optional[float] = 0.5, seed: int = - 1, learning_rate_decay_iterations: int = 50000000)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for A2C algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • n_steps (int) – number of rollout steps. Defaults to 5.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0007.

  • entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.

  • value_coefficient (float) – scalar of value loss. Defaults to 0.5.

  • decay (float) – decay parameter of Adam solver. Defaults to 0.99.

  • epsilon (float) – epislon of Adam solver. Defaults to 0.00001.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 1.

  • actor_num (int) – number of parallel actors. Defaults to 8.

  • timelimit_as_terminal (bool) – Treat as done if the environment reaches the timelimit. Defaults to False.

  • max_grad_norm (float) – threshold value for clipping gradient. Defaults to 0.5.

  • seed (int) – base seed of random number generator used by the actors. Defaults to 1.

  • learning_rate_decay_iterations (int) – learning rate will be decreased lineary to 0 till this iteration number. If 0 or negative, learning rate will be kept fixed. Defaults to 50000000.

class nnabla_rl.algorithms.a2c.A2C(env_or_env_info, v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.a2c.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.a2c.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, config=A2CConfig(gpu_id=-1, gamma=0.99, n_steps=5, learning_rate=0.0007, entropy_coefficient=0.01, value_coefficient=0.5, decay=0.99, epsilon=1e-05, start_timesteps=1, actor_num=8, timelimit_as_terminal=False, max_grad_norm=0.5, seed=-1, learning_rate_decay_iterations=50000000))[source]

Bases: nnabla_rl.algorithm.Algorithm

Advantage Actor-Critic (A2C) algorithm implementation.

This class implements the Advantage Actor-Critic (A2C) algorithm. A2C is the synchronous version of A3C, Asynchronous Advantage Actor-Critic. A3C was proposed by V. Mnih, et al. in the paper: “Asynchronous Methods for Deep Reinforcement Learning” For detail see: https://arxiv.org/abs/1602.01783

This algorithm only supports online training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

BCQ

class nnabla_rl.algorithms.bcq.BCQConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, phi: float = 0.05, num_q_ensembles: int = 2, num_action_samples: int = 10)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for BCQ algorithm

Parameters
  • gamma (float) – discount factor of reward. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.

  • phi (float) – action perturbator noise coefficient. Defaults to 0.05.

  • num_q_ensembles (int) – number of q function ensembles . Defaults to 2.

  • num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.

class nnabla_rl.algorithms.bcq.BCQ(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bcq.BCQConfig = BCQConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, phi=0.05, num_q_ensembles=2, num_action_samples=10), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bcq.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, vae_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bcq.DefaultVAEBuilder object>, vae_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, perturbator_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.perturbator.Perturbator] = <nnabla_rl.algorithms.bcq.DefaultPerturbatorBuilder object>, perturbator_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Batch-Constrained Q-learning (BCQ) algorithm

This class implements the Batch-Constrained Q-learning (BCQ) algorithm proposed by S. Fujimoto, et al. in the paper: “Off-Policy Deep Reinforcement Learning without Exploration” For details see: https://arxiv.org/abs/1812.02900

This algorithm only supports offline training.

Parameters
  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (BCQConfig) – configuration of the BCQ algorithm

  • q_function_builder (ModelBuilder[QFunction]) – builder of q-function models

  • q_solver_builder (SolverBuilder) – builder for q-function solvers

  • vae_builder (ModelBuilder[VariationalAutoEncoder]) – builder of variational auto encoder models

  • vae_solver_builder (SolverBuilder) – builder for variational auto encoder solvers

  • perturbator_builder (PerturbatorBuilder) – builder of perturbator models

  • perturbator_solver_builder (SolverBuilder) – builder for perturbator solvers

compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

BEAR

class nnabla_rl.algorithms.bear.BEARConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, epsilon: float = 0.05, num_q_ensembles: int = 2, num_mmd_actions: int = 5, num_action_samples: int = 10, mmd_type: str = 'gaussian', mmd_sigma: float = 20.0, initial_lagrange_multiplier: Optional[float] = None, fix_lagrange_multiplier: bool = False, warmup_iterations: int = 20000, use_mean_for_eval: bool = False)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for BEAR algorithm.

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.

  • epsilon (float) – inequality constraint of dual gradient descent. Defaults to 0.05.

  • num_q_ensembles (int) – number of q ensembles . Defaults to 2.

  • num_mmd_actions (int) – number of actions to sample for computing maximum mean discrepancy (MMD). Defaults to 5.

  • num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.

  • mmd_type (str) – kernel type used for MMD computation. laplacian or gaussian is supported. Defaults to gaussian.

  • mmd_sigma (float) – parameter used for adjusting the MMD. Defaults to 20.0.

  • initial_lagrange_multiplier (float, optional) – Initial value of lagrange multiplier. If not specified, random value sampled from normal distribution will be used instead.

  • fix_lagrange_multiplier (bool) – Either to fix the lagrange multiplier or not. Defaults to False.

  • warmup_iterations (int) – Number of iterations until start updating the policy. Defaults to 20000

  • use_mean_for_eval (bool) – Use mean value instead of best action among the samples for evaluation. Defaults to False.

class nnabla_rl.algorithms.bear.BEAR(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bear.BEARConfig = BEARConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, epsilon=0.05, num_q_ensembles=2, num_mmd_actions=5, num_action_samples=10, mmd_type='gaussian', mmd_sigma=20.0, initial_lagrange_multiplier=None, fix_lagrange_multiplier=False, warmup_iterations=20000, use_mean_for_eval=False), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bear.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, pi_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.bear.DefaultPolicyBuilder object>, pi_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, vae_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bear.DefaultVAEBuilder object>, vae_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, lagrange_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Bootstrapping Error Accumulation Reduction (BEAR) algorithm.

This class implements the Bootstrapping Error Accumulation Reduction (BEAR) algorithm proposed by A. Kumar, et al. in the paper: “Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction” For details see: https://arxiv.org/abs/1906.00949

This algorithm only supports offline training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

Categorical DDQN

class nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = - 10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean')[source]

Bases: nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig

class nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean'), value_distribution_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithms.categorical_dqn.CategoricalDQN

Categorical Double DQN algorithm.

This class implements the Categorical Double DQN algorithm introduced by M. Bellemare, et al. in the paper: “Rainbow: Combining Improvements in Deep Reinforcement Learning” For details see: https://arxiv.org/abs/1710.02298. The difference between Categorical DQN and this algorithm is the update target of q-value. This algorithm uses following double DQN style q-value target for Categorical Q value update. \(r + \gamma Q_{\text{target}}(s_{t+1}, \arg\max_{a}{Q(s_{t+1}, a)})\).

Parameters

Categorical DQN

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = - 10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean')[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for CategoricalDQN algorithm.

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (float) – the interval of learner update. Defaults to 4

  • target_update_frequency (float) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.

  • v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.

  • num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.

  • loss_reduction_method (str) – KL loss reduction method. “sum” or “mean” is supported. Defaults to mean.

class nnabla_rl.algorithms.categorical_dqn.CategoricalDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean'), value_distribution_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Categorical DQN algorithm.

This class implements the Categorical DQN algorithm proposed by M. Bellemare, et al. in the paper: “A Distributional Perspective on Reinfocement Learning” For details see: https://arxiv.org/abs/1707.06887

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

DDPG

class nnabla_rl.algorithms.ddpg.DDPGConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for DDPG algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.

class nnabla_rl.algorithms.ddpg.DDPG(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ddpg.DDPGConfig = DDPGConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1), critic_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.ddpg.DefaultCriticBuilder object>, critic_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, actor_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.ddpg.DefaultActorBuilder object>, actor_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.ddpg.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.ddpg.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Deep Deterministic Policy Gradient (DDPG) algorithm.

This class implements the modified version of the Deep Deterministic Policy Gradient (DDPG) algorithm proposed by T. P. Lillicrap, et al. in the paper: “Continuous control with deep reinforcement learning” For details see: https://arxiv.org/abs/1509.02971 We use gaussian noise instead of Ornstein-Uhlenbeck process to explore in the environment. The effectiveness of using gaussian noise for DDPG is reported in the paper: “Addressing Funciton Approximaiton Error in Actor-Critic Methods”. see https://arxiv.org/abs/1802.09477

Parameters
compute_eval_action(state)[source]

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

DDQN

class nnabla_rl.algorithms.ddqn.DDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Optional[Tuple[float, float]] = (- 1.0, 1.0))[source]

Bases: nnabla_rl.algorithms.dqn.DQNConfig

List of configurations for Double DQN (DDQN) algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.

  • grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).

class nnabla_rl.algorithms.ddqn.DDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ddqn.DDQNConfig = DDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0)), q_func_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithms.dqn.DQN

Double DQN algorithm.

This class implements the Deep Q-Network with double q-learning (DDQN) algorithm proposed by H. van Hasselt, et al. in the paper: “Deep Reinforcement Learning with Double Q-learning” For details see: https://arxiv.org/abs/1509.06461

Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DDQN. You can replace the solver by implementing a (SolverBuilder) and pass the solver on DDQN class instantiation.

Parameters
  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (DDQNConfig) – the parameter for DDQN training

  • q_func_builder (ModelBuilder) – builder of q function model

  • q_solver_builder (SolverBuilder) – builder of q function solver

  • replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

DQN

class nnabla_rl.algorithms.dqn.DQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Optional[Tuple[float, float]] = (- 1.0, 1.0))[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for DQN algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.

  • grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).

class nnabla_rl.algorithms.dqn.DQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.dqn.DQNConfig = DQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0)), q_func_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.dqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

DQN algorithm.

This class implements the Deep Q-Network (DQN) algorithm proposed by V. Mnih, et al. in the paper: “Human-level control through deep reinforcement learning” For details see: https://www.nature.com/articles/nature14236

Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DQN. You can replace the solver by implementing a (SolverBuilder) and pass the solver on DQN class instantiation.

Parameters
  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (DQNConfig) – the parameter for DQN training

  • q_func_builder (ModelBuilder) – builder of q function model

  • q_solver_builder (SolverBuilder) – builder of q function solver

  • replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

  • explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

GAIL

class nnabla_rl.algorithms.gail.GAILConfig(gpu_id: int = - 1, preprocess_state: bool = True, act_deterministic_in_eval: bool = True, discriminator_batch_size: int = 50000, discriminator_learning_rate: float = 0.01, discriminator_update_frequency: int = 1, adversary_entropy_coef: float = 0.001, policy_update_frequency: int = 1, gamma: float = 0.995, lmb: float = 0.97, pi_batch_size: int = 50000, num_steps_per_iteration: int = 50000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 10, vf_epochs: int = 5, vf_batch_size: int = 128, vf_learning_rate: float = 0.001)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for GAIL algorithm

Parameters
  • act_deterministic_in_eval (bool) – Enable act deterministically at evalution. Defaults to True.

  • discriminator_batch_size (bool) – Trainig batch size of discriminator. Usually, discriminator_batch_size is the same as pi_batch_size. Defaults to 50000.

  • discriminator_learning_rate (float) – Learning rate which is set to the solvers of dicriminator function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • discriminator_update_frequency (int) – Frequency (measured in the number of parameter update) of discriminator update. Defaults to 1.

  • adversary_entropy_coef (float) – Coefficient of entropy loss in dicriminator training. Defaults to 0.001.

  • policy_update_frequency (int) – Frequency (measured in the number of parameter update) of policy update. Defaults to 1.

  • gamma (float) – Discount factor of rewards. Defaults to 0.995.

  • lmb (float) – Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance.

  • num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 50000.

  • pi_batch_size (int) – Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 50000.

  • sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.

  • maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.

  • conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.

  • conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 10.

  • vf_epochs (int) – Number of epochs in each iteration. Defaults to 5.

  • vf_batch_size (int) – Training batch size of value function. Defaults to 128.

  • vf_learning_rate (float) – Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

class nnabla_rl.algorithms.gail.GAIL(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], expert_buffer: nnabla_rl.replay_buffer.ReplayBuffer, config: nnabla_rl.algorithms.gail.GAILConfig = GAILConfig(gpu_id=-1, preprocess_state=True, act_deterministic_in_eval=True, discriminator_batch_size=50000, discriminator_learning_rate=0.01, discriminator_update_frequency=1, adversary_entropy_coef=0.001, policy_update_frequency=1, gamma=0.995, lmb=0.97, pi_batch_size=50000, num_steps_per_iteration=50000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=10, vf_epochs=5, vf_batch_size=128, vf_learning_rate=0.001), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.gail.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultVFunctionSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.gail.DefaultPolicyBuilder object>, reward_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.reward_function.RewardFunction] = <nnabla_rl.algorithms.gail.DefaultRewardFunctionBuilder object>, reward_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultRewardFunctionSolverBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.gail.DefaultPreprocessorBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.gail.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Generative Adversarial Imitation Learning implementation.

This class implements the Generative Adversarial Imitation Learning (GAIL) algorithm proposed by Jonathan Ho, et al. in the paper: “Generative Adversarial Imitation Learning” For detail see: https://arxiv.org/abs/1606.03476

This algorithm only supports online training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

HER

class nnabla_rl.algorithms.her.HERConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1, n_cycles: int = 50, n_rollout: int = 16, n_update: int = 40, max_timesteps: int = 50, hindsight_prob: float = 0.8, action_loss_coef: float = 1.0, return_clip: Optional[Tuple[float, float]] = (- 50.0, 0.0), exploration_epsilon: float = 0.3, preprocess_state: bool = True, normalize_epsilon: float = 0.01, normalize_clip_range: Optional[Tuple[float, float]] = (- 5.0, 5.0), observation_clip_range: Optional[Tuple[float, float]] = (- 200.0, 200.0))[source]

Bases: nnabla_rl.algorithms.ddpg.DDPGConfig

List of configurations for HER algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.

  • n_cycles (int) – the number of cycle. A cycle means collecting experiences for some episodes and updating model for several times.

  • n_rollout (int) – the number of episode in which policy collect experiences.

  • n_update (int) – the number of updating model

  • max_timesteps (int) – the timestep when finishing one epsode.

  • hindsight_prob (float) – the probability at which buffer samples hindsight goal.

  • action_loss_coef (float) – the value of coefficient about action loss in policy trainer.

  • return_clip (Optional[Tuple[float, float]]) – the range of clipping return value.

  • exploration_epsilon (float) – the value for ε-greedy explorer.

  • preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

  • normalize_epsilon (float) – the minimum value of standard deviation of preprocessed state.

  • normalize_clip_range (Optional[Tuple[float, float]]) – the range of clipping state.

  • observation_clip_range (Optional[Tuple[float, float]]) – the range of clipping observation.

class nnabla_rl.algorithms.her.HER(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.her.HERConfig = HERConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1, n_cycles=50, n_rollout=16, n_update=40, max_timesteps=50, hindsight_prob=0.8, action_loss_coef=1.0, return_clip=(-50.0, 0.0), exploration_epsilon=0.3, preprocess_state=True, normalize_epsilon=0.01, normalize_clip_range=(-5.0, 5.0), observation_clip_range=(-200.0, 200.0)), critic_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.her.HERCriticBuilder object>, critic_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.her.HERSolverBuilder object>, actor_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.her.HERActorBuilder object>, actor_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.her.HERSolverBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.her.HERPreprocessorBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.her.HindsightReplayBufferBuilder object>)[source]

Bases: nnabla_rl.algorithms.ddpg.DDPG

Hindsight Experience Replay (HER) algorithm implementation.

This class implements the Hindsight Experience Replay (HER) algorithm proposed by M. Andrychowicz, et al. in the paper: “Hindsight Experience Replay” For detail see: https://arxiv.org/abs/1707.06347

This algorithm only supports online training.

Parameters

IQN

class nnabla_rl.algorithms.iqn.IQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for IQN algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.

  • N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.

  • K (int) – Number of samples to compute greedy next action. Defaults to 32.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

  • embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.

class nnabla_rl.algorithms.iqn.IQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.iqn.IQNConfig = IQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64), quantile_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.iqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.iqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.iqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Implicit Quantile Network algorithm.

This class implements the Implicit Quantile Network (IQN) algorithm proposed by W. Dabney, et al. in the paper: “Implicit Quantile Networks for Distributional Reinforcement Learning” For details see: https://arxiv.org/pdf/1806.06923.pdf

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

Munchausen DQN

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for Munchausen DQN algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.

  • munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.

  • clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig = MunchausenDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), q_func_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.munchausen_dqn.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultQSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Munchausen-DQN algorithm.

This class implements the Munchausen-DQN (Munchausen Deep Q Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

Munchausen IQN

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for Munchausen IQN algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training atch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.

  • N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.

  • K (int) – Number of samples to compute greedy next action. Defaults to 32.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

  • embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.

  • entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.

  • munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.

  • clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.

class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig = MunchausenIQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), risk_measure_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable] = <function risk_neutral_measure>, quantile_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.munchausen_iqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.munchausen_iqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Munchausen-IQN algorithm implementation.

This class implements the Munchausen-IQN (Munchausen Implicit Quantile Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf

Parameters
  • env_or_env_info (gym.Env or EnvironmentInfo) – the environment to train or environment info

  • config (MunchausenIQNConfig) – configuration of MunchausenIQN algorithm

  • risk_measure_function (Callable[[nn.Variable], nn.Variable]) – risk measure function to apply to the quantiles.

  • quantile_function_builder (ModelBuilder[StateActionQuantileFunction]) – builder of state-action quantile function models

  • quantile_solver_builder (SolverBuilder) – builder for state action quantile function solvers

  • replay_buffer_builder (ReplayBufferBuilder) – builder of replay_buffer

  • explorer_builder (ExplorerBuilder) – builder of environment explorer

compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

PPO

class nnabla_rl.algorithms.ppo.PPOConfig(gpu_id: int = - 1, epsilon: float = 0.1, gamma: float = 0.99, learning_rate: float = 0.00025, lmb: float = 0.95, entropy_coefficient: float = 0.01, value_coefficient: float = 1.0, actor_num: int = 8, epochs: int = 3, batch_size: int = 256, actor_timesteps: int = 128, total_timesteps: int = 10000, decrease_alpha: bool = True, timelimit_as_terminal: bool = False, seed: int = 1, preprocess_state: bool = True)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for PPO algorithm

Parameters
  • epsilon (float) – PPO’s probability ratio clipping range. Defaults to 0.1

  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025.

  • batch_size (int) – training batch size. Defaults to 256.

  • lmb (float) – scalar of lambda return’s computation in GAE. Defaults to 0.95.

  • entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.

  • value_coefficient (float) – scalar of value loss. Defaults to 1.0.

  • actor_num (int) – Number of parallel actors. Defaults to 8.

  • epochs (int) – Number of epochs to perform in each training iteration. Defaults to 3.

  • actor_timesteps (int) – Number of timesteps to interact with the environment by the actors. Defaults to 128.

  • total_timesteps (int) – Total number of timesteps to interact with the environment. Defaults to 10000.

  • decrease_alpha (bool) – Flag to control whether to decrease the learning rate linearly during the training. Defaults to True.

  • timelimit_as_terminal (bool) –

    Treat as done if the environment reaches the timelimit. Defaults to False.

  • seed (int) – base seed of random number generator used by the actors. Defaults to 1.

  • preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

class nnabla_rl.algorithms.ppo.PPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ppo.PPOConfig = PPOConfig(gpu_id=-1, epsilon=0.1, gamma=0.99, learning_rate=0.00025, lmb=0.95, entropy_coefficient=0.01, value_coefficient=1.0, actor_num=8, epochs=3, batch_size=256, actor_timesteps=128, total_timesteps=10000, decrease_alpha=True, timelimit_as_terminal=False, seed=1, preprocess_state=True), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.ppo.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.ppo.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.ppo.DefaultPreprocessorBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Proximal Policy Optimization (PPO) algorithm implementation.

This class implements the Proximal Policy Optimization (PPO) algorithm proposed by J. Schulman, et al. in the paper: “Proximal Policy Optimization Algorithms” For detail see: https://arxiv.org/abs/1707.06347

This algorithm only supports online training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

QRDQN

class nnabla_rl.algorithms.qrdqn.QRDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, num_steps: int = 1, learner_update_frequency: int = 4, target_update_frequency: int = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, num_quantiles: int = 200, kappa: float = 1.0)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for QRDQN algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00005.

  • batch_size (int) – training batch size. Defaults to 32.

  • num_steps (int) – number of steps for N-step Q targets. Defaults to 1.

  • learner_update_frequency (int) – the interval of learner update. Defaults to 4.

  • target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.

  • initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.

  • final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.

  • test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.

  • num_quantiles (int) – Number of quantile points. Defaults to 200.

  • kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.

class nnabla_rl.algorithms.qrdqn.QRDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.qrdqn.QRDQNConfig = QRDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, num_steps=1, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, num_quantiles=200, kappa=1.0), quantile_dist_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.QuantileDistributionFunction] = <nnabla_rl.algorithms.qrdqn.DefaultQuantileBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrdqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.qrdqn.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.qrdqn.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Quantile Regression DQN algorithm.

This class implements the Quantile Regression DQN algorithm proposed by W. Dabney, et al. in the paper: “Distributional Reinforcement Learning with Quantile Regression” For details see: https://arxiv.org/abs/1710.10044

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

Rainbow

class nnabla_rl.algorithms.rainbow.RainbowConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 6.25e-05, batch_size: int = 32, num_steps: int = 3, start_timesteps: int = 20000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 8000, max_explore_steps: int = 1000000, initial_epsilon: float = 0.0, final_epsilon: float = 0.0, test_epsilon: float = 0.0, v_min: float = - 10.0, v_max: float = 10.0, num_atoms: int = 51, loss_reduction_method: str = 'mean', alpha: float = 0.5, beta: float = 0.4, betasteps: int = 12500000, warmup_random_steps: int = 0, no_double: bool = False)[source]

Bases: nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQNConfig

List of configurations for Rainbow algorithm.

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.00025 / 4.

  • batch_size (int) – training batch size. Defaults to 32.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 20000.

  • replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.

  • learner_update_frequency (float) – the interval of learner update. Defaults to 4.

  • target_update_frequency (float) – the interval of target q-function update. Defaults to 8000.

  • v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.

  • v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.

  • num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.

  • num_steps (int) – the of steps to look ahead in n-step Q learning. Defaults to 3.

  • alpha (float) – priority exponent (written as omega in the rainbow paper) of prioritized buffer. Defaults to 0.5.

  • beta (float) – initial value of importance sampling exponent of prioritized buffer. Defaults to 0.4.

  • betasteps (int) – importance sampling exponent increase steps. After betasteps, exponent will get to 1.0. Defaults to 12500000.

  • warmup_random_steps (Optional[int]) – steps until this value will NOT use trained policy for exploration. Will explore with randomly selected action. Defaults to 0.

  • no_double (bool) – If true, following normal Q-learning style q value target will be used for categorical q value update. \(r + \gamma\max_{a}{Q_{\text{target}}(s_{t+1}, a)}\). Defaults to False.

class nnabla_rl.algorithms.rainbow.Rainbow(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.rainbow.RainbowConfig = RainbowConfig(gpu_id=-1, gamma=0.99, learning_rate=6.25e-05, batch_size=32, num_steps=3, start_timesteps=20000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=8000, max_explore_steps=1000000, initial_epsilon=0.0, final_epsilon=0.0, test_epsilon=0.0, v_min=-10.0, v_max=10.0, num_atoms=51, loss_reduction_method='mean', alpha=0.5, beta=0.4, betasteps=12500000, warmup_random_steps=0, no_double=False), value_distribution_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.rainbow.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.rainbow.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.rainbow.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.rainbow.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithms.categorical_ddqn.CategoricalDDQN

Rainbow algorithm. This class implements the Rainbow algorithm proposed by M. Bellemare, et al. in the paper: “Rainbow: Combining Improvements in Deep Reinforcement Learning” For details see: https://arxiv.org/abs/1710.02298

Parameters

REINFORCE

class nnabla_rl.algorithms.reinforce.REINFORCEConfig(gpu_id: int = - 1, reward_scale: float = 0.01, num_rollouts_per_train_iteration: int = 10, learning_rate: float = 0.001, clip_grad_norm: float = 1.0, fixed_ln_var: float = - 2.3025850929940455)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for REINFORCE algorithm

Parameters
  • reward_scale (float) – Scale of reward. Defaults to 0.01.

  • num_rollouts_per_train_iteration (int) – Number of rollout per each training iteration for collecting on-policy experinces.Increasing this step size is effective to get precise parameters of policy function updating, but computational time of each iteration will increase. Defaults to 10.

  • learning_rate (float) – Learning rate which is set to the solvers of policy function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • clip_grad_norm (float) – Clip to the norm of gradient to this value. Defaults to 1.0.

  • fixed_ln_var (float) – Fixed log variance of the policy. This configuration is only valid when the enviroment is continuous. Defaults to 1.0.

class nnabla_rl.algorithms.reinforce.REINFORCE(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.reinforce.REINFORCEConfig = REINFORCEConfig(gpu_id=-1, reward_scale=0.01, num_rollouts_per_train_iteration=10, learning_rate=0.001, clip_grad_norm=1.0, fixed_ln_var=-2.3025850929940455), policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.reinforce.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.reinforce.DefaultSolverBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.reinforce.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

episodic REINFORCE implementation.

This class implements the episodic REINFORCE algorithm proposed by Ronald J. Williams. in the paper: “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” For detail see: https://link.springer.com/content/pdf/10.1007/BF00992696.pdf

This algorithm only supports online training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

SAC

class nnabla_rl.algorithms.sac.SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: Optional[float] = None, initial_temperature: Optional[float] = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for SAC algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.

  • batch_size (int) – training batch size. Defaults to 256.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.

  • gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.

  • target_entropy (float, optional) – Target entropy value. Defaults to None.

  • initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.

  • fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

class nnabla_rl.algorithms.sac.SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.sac.SACConfig = SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.sac.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Soft Actor-Critic (SAC) algorithm implementation.

This class implements the extended version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic Algorithms and Applications” For detail see: https://arxiv.org/abs/1812.05905

This algorithm is slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1801.01290

The temperature parameter is adjusted automatically instead of providing reward scalar as a hyper parameter.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

SAC (ICML 2018 version)

class nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, reward_scalar: float = 5.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for ICML2018SAC algorithm.

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0003.

  • batch_size (int) – training batch size. Defaults to 256.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.

  • gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.

  • reward_scalar (float) – Reward scaling factor. Obtained reward will be multiplied by this value. Defaults to 5.0.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • target_update_interval (float) – the interval of target v function parameter’s update. Defaults to 1.

class nnabla_rl.algorithms.icml2018_sac.ICML2018SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig = ICML2018SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, reward_scalar=5.0, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2018_sac.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Soft Actor-Critic (SAC) algorithm.

This class implements the ICML2018 version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” For detail see: https://arxiv.org/abs/1801.01290

This implementation slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1812.05905

You will need to scale the reward received from the environment properly to get the algorithm work.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

TD3

class nnabla_rl.algorithms.td3.TD3Config(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, d: int = 2, exploration_noise_sigma: float = 0.1, train_action_noise_sigma: float = 0.2, train_action_noise_abs: float = 0.5)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for TD3 algorithm

Parameters
  • gamma (float) – discount factor of rewards. Defaults to 0.99.

  • learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.003.

  • batch_size (int) – training batch size. Defaults to 100.

  • tau (float) – target network’s parameter update coefficient. Defaults to 0.005.

  • start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.

  • replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.

  • d (int) – Interval of the policy update. The policy will be updated every d q-function updates. Defaults to 2.

  • exploration_noise_sigma (float) – Standard deviation of the gaussian exploration noise. Defaults to 0.1.

  • train_action_noise_sigma (float) – Standard deviation of the gaussian action noise used in the training. Defaults to 0.2.

  • train_action_noise_abs (float) – Absolute limit value of action noise used in the training. Defaults to 0.5.

class nnabla_rl.algorithms.td3.TD3(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.td3.TD3Config = TD3Config(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, d=2, exploration_noise_sigma=0.1, train_action_noise_sigma=0.2, train_action_noise_abs=0.5), critic_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.td3.DefaultCriticBuilder object>, critic_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, actor_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.td3.DefaultActorBuilder object>, actor_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.td3.DefaultReplayBufferBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.td3.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Twin Delayed Deep Deterministic policy gradient (TD3) algorithm.

This class implements the Twin Delayed Deep Deteministic policy gradien (TD3) algorithm proposed by S. Fujimoto, et al. in the paper: “Addressing Function Approximation Error in Actor-Critic Methods” For detail see: https://arxiv.org/abs/1802.09477

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

TRPO

class nnabla_rl.algorithms.trpo.TRPOConfig(gpu_id: int = - 1, gamma: float = 0.995, lmb: float = 0.97, num_steps_per_iteration: int = 5000, pi_batch_size: int = 5000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 20, vf_epochs: int = 5, vf_batch_size: int = 64, vf_learning_rate: float = 0.001, preprocess_state: bool = True, gpu_batch_size: Optional[int] = None)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for TRPO algorithm

Parameters
  • gamma (float) – Discount factor of rewards. Defaults to 0.995.

  • lmb (float) – Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance.

  • num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 5000.

  • pi_batch_size (int) – Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 5000.

  • sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.

  • maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.

  • conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.

  • conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 20.

  • vf_epochs (int) – Number of epochs in each iteration. Defaults to 5.

  • vf_batch_size (int) – Training batch size of value function. Defaults to 64.

  • vf_learning_rate (float) – Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.001.

  • preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.

  • gpu_batch_size (int, optional) – Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None.

class nnabla_rl.algorithms.trpo.TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.trpo.TRPOConfig = TRPOConfig(gpu_id=-1, gamma=0.995, lmb=0.97, num_steps_per_iteration=5000, pi_batch_size=5000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=20, vf_epochs=5, vf_batch_size=64, vf_learning_rate=0.001, preprocess_state=True, gpu_batch_size=None), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.trpo.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.trpo.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.trpo.DefaultPolicyBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.trpo.DefaultPreprocessorBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.trpo.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Trust Region Policy Optimiation method with Generalized Advantage Estimation (GAE) implementation.

This class implements the Trust Region Policy Optimiation (TRPO) with Generalized Advantage Estimation (GAE) algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” and “High-Dimensional Continuous Control Using Generalized Advantage Estimation” For detail see: https://arxiv.org/abs/1502.05477 and https://arxiv.org/abs/1506.02438

This algorithm only supports online training.

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

TRPO (ICML 2015 version)

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig(gpu_id: int = - 1, gamma: float = 0.99, num_steps_per_iteration: int = 100000, batch_size: int = 100000, gpu_batch_size: Optional[int] = None, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.001, conjugate_gradient_iterations: int = 10)[source]

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for ICML2015TRPO algorithm

Parameters
  • gamma (float) – Discount factor of rewards. Defaults to 0.99.

  • num_steps_per_iteration (int) – Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 100000.

  • batch_size (int) – Trainig batch size of policy. Usually, batch_size is the same as num_steps_per_iteration. Defaults to 100000.

  • gpu_batch_size (int, optional) – Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None.

  • sigma_kl_divergence_constraint (float) – Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01.

  • maximum_backtrack_numbers (int) – Maximum backtrack numbers of linesearch. Defaults to 10.

  • conjugate_gradient_damping (float) – Damping size of conjugate gradient method. Defaults to 0.1.

  • conjugate_gradient_iterations (int) – Number of iterations of conjugate gradient method. Defaults to 20.

class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig = ICML2015TRPOConfig(gpu_id=-1, gamma=0.99, num_steps_per_iteration=100000, batch_size=100000, gpu_batch_size=None, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.001, conjugate_gradient_iterations=10), policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2015_trpo.DefaultPolicyBuilder object>, explorer_builder: nnabla_rl.builders.explorer_builder.ExplorerBuilder = <nnabla_rl.algorithms.icml2015_trpo.DefaultExplorerBuilder object>)[source]

Bases: nnabla_rl.algorithm.Algorithm

Trust Region Policy Optimiation method with Single Path algorithm.

This class implements the Trust Region Policy Optimiation (TRPO) with Single Path algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” For detail see: https://arxiv.org/abs/1502.05477

Parameters
compute_eval_action(**kwargs)

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters

state (np.ndarray) – state to compute the action.

Returns

Action for given state using current trained policy.

Return type

np.ndarray

property latest_iteration_state

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns

Dictionary with items of training process state.

Return type

Dict[str, Any]

Builders

Builder Class

ExplorerBuilder

class nnabla_rl.builders.ExplorerBuilder[source]

Explorer builder interface class

build_explorer(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, algorithm: nnabla_rl.algorithm.Algorithm, **kwargs) nnabla_rl.environment_explorer.EnvironmentExplorer[source]

Build explorer.

Parameters
  • env_info (EnvironmentInfo) – environment information

  • algorithm_config (AlgorithmConfig) – configuration class of target algorithm. Actual type differs depending on the algorithm.

  • algorithm (Algorithm) – target algorithm. Actual type differs depending on the algorithm.

Returns

explorer instance.

Return type

EnvironmentExplorer

ModelBuilder

class nnabla_rl.builders.ModelBuilder(*args, **kwds)[source]

Model builder interface class

build_model(scope_name: str, env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) nnabla_rl.builders.model_builder.T[source]

Build model.

Parameters
  • scope_name (str) – the scope name of model

  • env_info (EnvironmentInfo) – environment information

  • algorithm_config (AlgorithmConfig) – configuration class of target algorithm. Actual type differs depending on the algorithm.

Returns

model instance. The type of the model depends on the builder’s generic type.

Return type

T

PreprocessorBuilder

class nnabla_rl.builders.PreprocessorBuilder[source]

Preprocessor builder interface class

build_preprocessor(scope_name: str, env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) nnabla_rl.preprocessors.preprocessor.Preprocessor[source]

Build preprocessor

Parameters
  • scope_name (str) – the scope name of model

  • env_info (EnvironmentInfo) – environment information

  • algorithm_config (AlgorithmConfig) – configuration class of target algorithm. Actual type differs depending on the algorithm.

Returns

preprocessor instance.

Return type

Preprocessor

ReplayBufferBuilder

class nnabla_rl.builders.ReplayBufferBuilder[source]

ReplayBuffer builder interface class

build_replay_buffer(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) nnabla_rl.replay_buffer.ReplayBuffer[source]

Build replay buffer

Parameters
Returns

replay buffer instance.

Return type

ReplayBuffer

SolverBuilder

class nnabla_rl.builders.SolverBuilder[source]

Solver builder interface class

build_solver(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) nnabla.solver.Solver[source]

Build solver function

Parameters
Returns

solver instance.

Return type

Solver

Distributions

All probability distributions are derived from nnabla_rl.distributions.Distribution

Distribution

class nnabla_rl.distributions.Distribution[source]
choose_probable() nnabla._variable.Variable[source]

Compute the most probable action of the distribution

Returns

Probable action of the distribution

Return type

nnabla.Variable

entropy() nnabla._variable.Variable[source]

Compute the entropy of the distribution

Returns

Entropy of the distribution

Return type

nn.Variable

kl_divergence(q: nnabla_rl.distributions.distribution.Distribution) nnabla._variable.Variable[source]

Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)

Parameters

q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence

Returns

Kullback leibler divergence

Return type

nn.Variable

Raises

ValueError – target distribution’s type does not match with current distribution type.

log_prob(x: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the log probability of given input

Parameters

x (nn.Variable) – Target value to compute the log probability

Returns

Log probability of given input

Return type

nn.Variable

mean() nnabla._variable.Variable[source]

Compute the mean of the distribution (if exist)

Returns

mean of the distribution

Return type

nn.Variable

Raises

NotImplementedError – The distribution does not have mean

property ndim: int

The number of dimensions of the distribution

abstract sample(noise_clip: Optional[Tuple[float, float]] = None) nnabla._variable.Variable[source]

Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value

Return type

nn.Variable

sample_and_compute_log_prob(noise_clip: Optional[Tuple[float, float]] = None) Tuple[nnabla._variable.Variable, nnabla._variable.Variable][source]

Sample a value from the distribution and compute its log probability.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value and its log probabilty

Return type

Tuple[nn.Variable, nn.Variable]

sample_multiple(num_samples: int, noise_clip: Optional[Tuple[float, float]] = None) nnabla._variable.Variable[source]

Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)

If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters
  • num_samples (int) – number of samples per batch

  • noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value.

Return type

nn.Variable

List of Distributions

class nnabla_rl.distributions.Gaussian(mean, ln_var)[source]

Bases: nnabla_rl.distributions.distribution.Distribution

Gaussian distribution

\(\mathcal{N}(\mu,\,\sigma^{2})\)

Parameters
  • mean (nn.Variable) – mean \(\mu\) of gaussian distribution.

  • ln_var (nn.Variable) – logarithm of the variance \(\sigma^{2}\). (i.e. ln_var is \(\log{\sigma^{2}}\))

choose_probable()[source]

Compute the most probable action of the distribution

Returns

Probable action of the distribution

Return type

nnabla.Variable

entropy()[source]

Compute the entropy of the distribution

Returns

Entropy of the distribution

Return type

nn.Variable

kl_divergence(q)[source]

Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)

Parameters

q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence

Returns

Kullback leibler divergence

Return type

nn.Variable

Raises

ValueError – target distribution’s type does not match with current distribution type.

log_prob(x)[source]

Compute the log probability of given input

Parameters

x (nn.Variable) – Target value to compute the log probability

Returns

Log probability of given input

Return type

nn.Variable

mean()[source]

Compute the mean of the distribution (if exist)

Returns

mean of the distribution

Return type

nn.Variable

Raises

NotImplementedError – The distribution does not have mean

property ndim

The number of dimensions of the distribution

sample(noise_clip=None)[source]

Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value

Return type

nn.Variable

sample_and_compute_log_prob(noise_clip=None)[source]

Sample a value from the distribution and compute its log probability.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value and its log probabilty

Return type

Tuple[nn.Variable, nn.Variable]

sample_multiple(num_samples, noise_clip=None)[source]

Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)

If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters
  • num_samples (int) – number of samples per batch

  • noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value.

Return type

nn.Variable

class nnabla_rl.distributions.Softmax(z)[source]

Bases: nnabla_rl.distributions.distribution.Distribution

Softmax distribution which samples a class index \(i\) according to the following probability.

\(i \sim \frac{\exp{z_{i}}}{\sum_{j}\exp{z_{j}}}\).

Parameters

z (nn.Variable) – logits \(z\). Logits’ dimension should be same as the number of class to sample.

choose_probable()[source]

Compute the most probable action of the distribution

Returns

Probable action of the distribution

Return type

nnabla.Variable

entropy()[source]

Compute the entropy of the distribution

Returns

Entropy of the distribution

Return type

nn.Variable

kl_divergence(q)[source]

Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)

Parameters

q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence

Returns

Kullback leibler divergence

Return type

nn.Variable

Raises

ValueError – target distribution’s type does not match with current distribution type.

log_prob(x)[source]

Compute the log probability of given input

Parameters

x (nn.Variable) – Target value to compute the log probability

Returns

Log probability of given input

Return type

nn.Variable

mean()[source]

Compute the mean of the distribution (if exist)

Returns

mean of the distribution

Return type

nn.Variable

Raises

NotImplementedError – The distribution does not have mean

property ndim

The number of dimensions of the distribution

sample(noise_clip=None)[source]

Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value

Return type

nn.Variable

sample_and_compute_log_prob(noise_clip=None)[source]

Sample a value from the distribution and compute its log probability.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value and its log probabilty

Return type

Tuple[nn.Variable, nn.Variable]

sample_multiple(num_samples, noise_clip=None)[source]

Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)

If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters
  • num_samples (int) – number of samples per batch

  • noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value.

Return type

nn.Variable

class nnabla_rl.distributions.SquashedGaussian(mean, ln_var)[source]

Bases: nnabla_rl.distributions.distribution.Distribution

Gaussian distribution which its output is squashed with tanh.

\(z \sim \mathcal{N}(\mu,\,\sigma^{2})\). \(out = \tanh{z}\).

Parameters
  • mean (nn.Variable) – mean \(\mu\) of underlying gaussian distribution.

  • ln_var (nn.Variable) – logarithm of the variance \(\sigma^{2}\). (i.e. ln_var is \(\log{\sigma^{2}}\))

Note

The log probability and kl_divergence of this distribution is different from Gaussian distribution because the output is squashed.

choose_probable()[source]

Compute the most probable action of the distribution

Returns

Probable action of the distribution

Return type

nnabla.Variable

log_prob(x)[source]

Compute the log probability of given input

Parameters

x (nn.Variable) – Target value to compute the log probability

Returns

Log probability of given input

Return type

nn.Variable

property ndim

The number of dimensions of the distribution

sample(noise_clip=None)[source]

Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters

noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value

Return type

nn.Variable

sample_and_compute_log_prob(noise_clip=None)[source]

NOTE: In order to avoid sampling different random values for sample and log_prob, you’ll need to use nnabla.forward_all(sample, log_prob) If you forward the two variables independently, you’ll get a log_prob for different sample, since different random variables are sampled internally.

sample_multiple(num_samples, noise_clip=None)[source]

Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)

If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.

Parameters
  • num_samples (int) – number of samples per batch

  • noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.

Returns

Sampled value.

Return type

nn.Variable

sample_multiple_and_compute_log_prob(num_samples, noise_clip=None)[source]

NOTE: In order to avoid sampling different random values for sample and log_prob, you’ll need to use nnabla.forward_all(sample, log_prob) If you forward the two variables independently, you’ll get a log_prob for different sample, since different random variables are sampled internally.

Environment explorers

All explorers are derived from nnabla_rl.environment_explorer.EnvironmentExplorer.

EnvironmentExplorer

class nnabla_rl.environment_explorer.EnvironmentExplorerConfig(warmup_random_steps: int = 0, reward_scalar: float = 1.0, timelimit_as_terminal: bool = True, initial_step_num: int = 0)[source]
class nnabla_rl.environment_explorer.EnvironmentExplorer(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, config: nnabla_rl.environment_explorer.EnvironmentExplorerConfig = EnvironmentExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0))[source]

Base class for environment exploration methods.

abstract action(steps: int, state: numpy.ndarray) numpy.ndarray[source]

Compute the action for given state at given timestep

Parameters
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

Returns

action for current state at given timestep

Return type

np.ndarray

rollout(env: gym.core.Env) List[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]][source]

Rollout the episode in current env

Parameters

env (gym.Env) – Environment

Returns

List of experience.

Experience consists of (state, action, reward, terminal flag, next state and extra info).

Return type

List[Experience]

step(env: gym.core.Env, n: int = 1, break_if_done: bool = False) List[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]][source]

Step n timesteps in given env

Parameters
  • env (gym.Env) – Environment

  • n (int) – Number of timesteps to act in the environment

Returns

List of experience.

Experience consists of (state, action, reward, terminal flag, next state and extra info).

Return type

List[Experience]

LinearDecayEpsilonGreedyExplorer

class nnabla_rl.environment_explorers.LinearDecayEpsilonGreedyExplorer(greedy_action_selector: Callable[[numpy.ndarray], Tuple[numpy.ndarray, Dict]], random_action_selector: Callable[[numpy.ndarray], Tuple[numpy.ndarray, Dict]], env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, config: nnabla_rl.environment_explorers.epsilon_greedy_explorer.LinearDecayEpsilonGreedyExplorerConfig = LinearDecayEpsilonGreedyExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0, initial_epsilon=1.0, final_epsilon=0.05, max_explore_steps=1000000))[source]

Linear decay epsilon-greedy explorer

Epsilon-greedy style explorer. Epsilon is linearly decayed until max_eplore_steps set in the config.

Parameters
  • greedy_action_selector (Callable[[np.ndarray], Tuple[np.ndarray, Dict]]) – callable which computes greedy action with respect to current state.

  • random_action_selector (Callable[[np.ndarray], Tuple[np.ndarray, Dict]]) – callable which computes random action that can be executed in the environment.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step, state)[source]

Compute the action for given state at given timestep

Parameters
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

Returns

action for current state at given timestep

Return type

np.ndarray

GaussianExplorer

class nnabla_rl.environment_explorers.GaussianExplorer(policy_action_selector: Callable[[numpy.ndarray], Tuple[numpy.ndarray, Dict]], env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, config: nnabla_rl.environment_explorers.gaussian_explorer.GaussianExplorerConfig = GaussianExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0, action_clip_low=2.2250738585072014e-308, action_clip_high=1.7976931348623157e+308, sigma=1.0))[source]

Gaussian explorer

Explore using policy’s action without gaussian noise appended to it. Policy’s action must be continuous action.

Parameters
  • policy_action_selector (Callable[[np.ndarray], Tuple[np.ndarray, Dict]]) – callable which computes current policy’s action with respect to current state.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step, state)[source]

Compute the action for given state at given timestep

Parameters
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

Returns

action for current state at given timestep

Return type

np.ndarray

RawPolicyExplorer

class nnabla_rl.environment_explorers.RawPolicyExplorer(policy_action_selector: Callable[[numpy.ndarray], Tuple[numpy.ndarray, Dict]], env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, config: nnabla_rl.environment_explorers.raw_policy_explorer.RawPolicyExplorerConfig = RawPolicyExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0))[source]

Raw policy explorer

Explore using policy’s action without any changes.

Parameters
  • policy_action_selector (Callable[[np.ndarray], Tuple[np.ndarray, Dict]]) – callable which computes current policy’s action with respect to current state.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step, state)[source]

Compute the action for given state at given timestep

Parameters
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

Returns

action for current state at given timestep

Return type

np.ndarray

Environments

EnvironmentInfo

class nnabla_rl.environments.environment_info.EnvironmentInfo(observation_space, action_space, max_episode_steps, unwrapped_env, reward_function: Optional[Callable[[Any, Any, Dict], int]] = None)[source]

Environment Information class

This class contains the basic information of the target training environment.

property action_dim

The dimension of action assuming that the action is flatten.

property action_high

The upper limit of action space

property action_low

The lower limit of action space

property action_shape

The shape of action space

static from_env(env)[source]

Create env_info from environment

Parameters

env (gym.Env) – the environment

Returns

EnvironmentInfo (EnvironmentInfo)

Example

>>> import gym
>>> from nnabla_rl.environments.environment_info import EnvironmentInfo
>>> env = gym.make("CartPole-v0")
>>> env_info = EnvironmentInfo.from_env(env)
>>> env_info.state_shape
(4,)
is_continuous_action_env()[source]

Check whether the action to execute in the environment is continuous or not

Returns

True if the action to execute in the environment is continuous. Otherwise False.

Note that if the action is gym.spaces.Tuple and all of the element are continuous, it returns True.

Return type

bool

is_continuous_state_env()[source]

Check whether the state of the environment is continuous or not

Returns

True if the state of the environment is continuous. Otherwise False.

Note that if the state is gym.spaces.Tuple and all of the element are continuous, it returns True.

Return type

bool

is_discrete_action_env()[source]

Check whether the action to execute in the environment is discrete or not

Returns

True if the action to execute in the environment is discrete. Otherwise False.

Note that if the action is gym.spaces.Tuple and all of the element are discrete, it returns True.

Return type

bool

is_discrete_state_env()[source]

Check whether the state of the environment is discrete or not

Returns

True if the state of the environment is discrete. Otherwise False.

Note that if the state is gym.spaces.Tuple and all of the element are discrete, it returns True.

Return type

bool

is_goal_conditioned_env()[source]

Check whether the environment is gym.GoalEnv or not

Returns

True if the environment is gym.GoalEnv. Otherwise False.

Return type

bool

is_tuple_state_env()[source]

Check whether the state of the environment is tuple or not

Returns

True if the state of the environment is tuple. Otherwise False.

Return type

bool

property state_dim

The dimension of state assuming that the state is flatten.

property state_high

The upper limit of observation space

property state_low

The lower limit of observation space

property state_shape

The shape of observation space

Functions

nnabla_rl.functions.sample_gaussian(mean: nnabla._variable.Variable, ln_var: nnabla._variable.Variable, noise_clip: Optional[Tuple[float, float]] = None) nnabla._variable.Variable[source]

Sample value from a gaussian distribution of given mean and variance.

Parameters
  • mean (nn.Variable) – Mean of the gaussian distribution

  • ln_var (nn.Variable) – Logarithm of the variance of the gaussian distribution

  • noise_clip (Optional[Tuple(float, float)]) – Clipping value of the sampled noise.

Returns

Sampled value from gaussian distribution of given mean and variance

Return type

nn.Variable

nnabla_rl.functions.sample_gaussian_multiple(mean: nnabla._variable.Variable, ln_var: nnabla._variable.Variable, num_samples: int, noise_clip: Optional[Tuple[float, float]] = None) nnabla._variable.Variable[source]

Sample multiple values from a gaussian distribution of given mean and variance. The returned variable will have an additional axis in the middle as follows (batch_size, num_samples, dimension)

Parameters
  • mean (nn.Variable) – Mean of the gaussian distribution

  • ln_var (nn.Variable) – Logarithm of the variance of the gaussian distribution

  • num_samples (int) – Number of samples to sample

  • noise_clip (Optional[Tuple(float, float)]) – Clipping value of the sampled noise.

Returns

Sampled values from gaussian distribution of given mean and variance

Return type

nn.Variable

nnabla_rl.functions.expand_dims(x: nnabla._variable.Variable, axis: int) nnabla._variable.Variable[source]

Add dimension to target axis of given variable

Parameters
  • x (nn.Variable) – Variable to expand the dimension

  • axis (int) – The axis to expand the dimension. Non negative.

Returns

Variable with additional dimension in the target axis

Return type

nn.Variable

nnabla_rl.functions.repeat(x: nnabla._variable.Variable, repeats: int, axis: int) nnabla._variable.Variable[source]

Repeats the value along given axis for repeats times.

Parameters
  • x (nn.Variable) – Variable to repeat the values along given axis

  • repeats (int) – Number of times to repeat

  • axis (int) – The axis to expand the dimension. Non negative.

Returns

Variable with values repeated along given axis

Return type

nn.Variable

nnabla_rl.functions.sqrt(x: nnabla._variable.Variable)[source]

Compute the squared root of given variable

Parameters

x (nn.Variable) – Variable to compute the squared root

Returns

Squared root of given variable

Return type

nn.Variable

nnabla_rl.functions.std(x: nnabla._variable.Variable, axis: Optional[int] = None, keepdims: bool = False) nnabla._variable.Variable[source]

Compute the standard deviation of given variable along axis.

Parameters
  • x (nn.Variable) – Variable to compute the squared root

  • axis (Optional[int]) – Axis to compute the standard deviation. Defaults to None. None will reduce all dimensions.

  • keepdims (bool) – Flag whether the reduced axis are kept as a dimension with 1 element.

Returns

Standard deviation of given variable along axis.

Return type

nn.Variable

nnabla_rl.functions.argmax(x: nnabla._variable.Variable, axis: Optional[int] = None, keepdims: bool = False) nnabla._variable.Variable[source]

Compute the index which given variable has maximum value along the axis.

Parameters
  • x (nn.Variable) – Variable to compute the argmax

  • axis (Optional[int]) – Axis to compare the values. Defaults to None. None will reduce all dimensions.

  • keepdims (bool) – Flag whether the reduced axis are kept as a dimension with 1 element.

Returns

Index of the variable which its value is maximum along the axis

Return type

nn.Variable

nnabla_rl.functions.quantile_huber_loss(x0: nnabla._variable.Variable, x1: nnabla._variable.Variable, kappa: float, tau: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the quantile huber loss. See following papers for details:

Parameters
  • x0 (nn.Variable) – Quantile values

  • x1 (nn.Variable) – Quantile values

  • kappa (float) – Threshold value of huber loss which switches the loss value between squared loss and linear loss

  • tau (nn.Variable) – Quantile targets

Returns

Quantile huber loss

Return type

nn.Variable

nnabla_rl.functions.mean_squared_error(x0: nnabla._variable.Variable, x1: nnabla._variable.Variable) nnabla._variable.Variable[source]

Convenient alias for mean squared error operation

Parameters
  • x0 (nn.Variable) – N-D array

  • x1 (nn.Variable) – N-D array

Returns

Mean squared error between x0 and x1

Return type

nn.Variable

nnabla_rl.functions.minimum_n(variables: Sequence[nnabla._variable.Variable]) nnabla._variable.Variable[source]

Compute the minimum among the list of variables

Parameters

variables (Sequence[nn.Variable]) – Sequence of variables. All the variables must have same shape.

Returns

Minimum value among the list of variables

Return type

nn.Variable

nnabla_rl.functions.gaussian_cross_entropy_method(objective_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable], init_mean: nnabla._variable.Variable, init_var: nnabla._variable.Variable, pop_size: int = 500, num_elites: int = 10, num_iterations: int = 5, alpha: float = 0.25) Tuple[nnabla._variable.Variable, nnabla._variable.Variable][source]

Optimize objective function with respect to input using cross entropy method using gaussian distribution

Examples

>>> import numpy as np
>>> import nnabla as nn
>>> import nnabla.functions as NF
>>> import nnabla_rl.functions as RF
>>> def objective_function(x): return -((x - 3.)**2)
>>> batch_size = 1
>>> variable_size = 1
>>> init_mean = nn.Variable.from_numpy_array(np.zeros((batch_size, state_size)))
>>> init_var = nn.Variable.from_numpy_array(np.ones((batch_size, state_size)))
>>> optimal_x, _ = RF.gaussian_cross_entropy_method(objective_function, init_mean, init_var, alpha=0)
>>> optimal_x.forward()
>>> optimal_x.shape
(1, 1)  # (batch_size, variable_size)
>>> optimal_x.d
array([[3.]], dtype=float32)
Parameters
  • objective_function (Callable[[nn.Variable], nn.Variable]) – objective function

  • init_mean (nn.Variable) – initial mean

  • init_var (nn.Variable) – initial variance

  • pop_size (int) – pop size

  • num_elites (int) – number of elites

  • num_iterations (int) – number of iterations

  • alpha (float) – parameter of soft update

Returns

mean of elites samples and top of elites samples

Return type

Tuple[nn.Variable, nn.Variable]

nnabla_rl.functions.triangular_matrix(diagonal: nnabla._variable.Variable, non_diagonal: Optional[nnabla._variable.Variable] = None, upper=False) nnabla._variable.Variable[source]

Compute triangular_matrix from given diagonal and non_diagonal elements. If non_diagonal is None, will create a diagonal matrix.

Example

>>> import numpy as np
>>> import nnabla as nn
>>> import nnabla.functions as NF
>>> import nnabla_rl.functions as RF
>>> diag_size = 3
>>> batch_size = 2
>>> non_diag_size = diag_size * (diag_size - 1) // 2
>>> diagonal = nn.Variable.from_numpy_array(np.ones(6).astype(np.float32).reshape((batch_size, diag_size)))
>>> non_diagonal = nn.Variable.from_numpy_array(np.arange(batch_size*non_diag_size).astype(np.float32).reshape((batch_size, non_diag_size)))
>>> diagonal.d
array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
>>> non_diagonal.d
array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)
>>> lower_triangular_matrix = RF.triangular_matrix(diagonal, non_diagonal)
>>> lower_triangular_matrix.forward()
>>> lower_triangular_matrix.d
array([[[1., 0., 0.],
        [0., 1., 0.],
        [1., 2., 1.]],
       [[1., 0., 0.],
        [3., 1., 0.],
        [4., 5., 1.]]], dtype=float32)
Parameters
  • diagonal (nn.Variable) – diagonal elements of lower triangular matrix. It’s shape must be (batch_size, diagonal_size).

  • non_diagonal (nn.Variable or None) – non-diagonal part of lower triangular elements. It’s shape must be (batch_size, diagonal_size * (diagonal_size - 1) // 2).

  • upper (bool) – If true will create an upper triangular matrix. Otherwise will create a lower triangular matrix.

Returns

lower triangular matrix constructed from given variables.

Return type

nn.Variable

nnabla_rl.functions.batch_flatten(x: nnabla._variable.Variable) nnabla._variable.Variable[source]

Collapse the variable shape into (batch_size, rest).

Example

>>> import numpy as np
>>> import nnabla as nn
>>> import nnabla_rl.functions as RF
>>> variable_shape = (3, 4, 5, 6)
>>> x = nn.Variable.from_numpy_array(np.random.normal(size=variable_shape))
>>> x.shape
(3, 4, 5, 6)
>>> flattened_x = RF.batch_flatten(x)
>>> flattened_x.shape
(3, 120)
Parameters

x (nn.Variable) – N-D array

Returns

Flattened variable.

Return type

nn.Variable

Hooks

Hook is a utility tool for training. All hooks are derived from nnabla_rl.hook.Hook

Hook

class nnabla_rl.hook.Hook(timing=1000)[source]

Base class of hooks for Algorithm classes.

Hook is called at every ‘timing’ iterations during the training. ‘timing’ is specified at the beginning of the class instantiation.

abstract on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

List of Hooks

class nnabla_rl.hooks.EvaluationHook(env, evaluator=<nnabla_rl.utils.evaluator.EpisodicEvaluator object>, timing=1000, writer=None)[source]

Bases: nnabla_rl.hook.Hook

Hook to run evaluation during training.

Parameters
  • env (gym.Env) – Environment to run the evaluation

  • evaluator (Callable[[nnabla_rl.algorithm.Algorithm, gym.Env], List[float]]) – Evaluator which runs the actual evaluation. Defaults to EpisodicEvaluator.

  • timing (int) – Evaluation interval. Defaults to 1000 iteration.

  • writer (nnabla_rl.writer.Writer, optional) – Writer instance to save/print the evaluation results. Defaults to None.

on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

class nnabla_rl.hooks.IterationNumHook(timing=1)[source]

Bases: nnabla_rl.hook.Hook

Hook to print the iteration number periodically. This hook just prints the iteration number of training.

Parameters

timing (int) – Printing interval. Defaults to 1 iteration.

on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

class nnabla_rl.hooks.IterationStateHook(writer=None, timing=1000)[source]

Bases: nnabla_rl.hook.Hook

Hook which retrieves the iteration state to print/save the training status through writer.

Parameters
  • timing (int) – Retriving interval. Defaults to 1000 iteration.

  • writer (nnabla_rl.writer.Writer, optional) – Writer instance to save/print the iteration states. Defaults to None.

on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

class nnabla_rl.hooks.SaveSnapshotHook(outdir, timing=1000)[source]

Bases: nnabla_rl.hook.Hook

Hook to save the training snapshot of current algorithm.

Parameters

timing (int) – Saving interval. Defaults to 1000 iteration.

on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

class nnabla_rl.hooks.TimeMeasuringHook(timing=1)[source]

Bases: nnabla_rl.hook.Hook

Hook to measure and print the actual time spent to run the iteration(s).

Parameters

timing (int) – Measuring interval. Defaults to 1 iteration.

on_hook_called(algorithm)[source]

Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.

Parameters

algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.

Models

All models are derived from nnabla_rl.models.Model

Model

class nnabla_rl.models.model.Model(scope_name: str)[source]

Model Class

Parameters

scope_name (str) – the scope name of model

deepcopy(new_scope_name: str) nnabla_rl.models.model.Model[source]

Create a copy of the model. All the model parameter’s (if exist) associated with will be copied.

Parameters

new_scope_name (str) – scope_name of parameters for newly created model

Returns

copied model

Return type

Model

Raises

ValueError – Given scope name is same as the model or already exists.

get_parameters(grad_only: bool = True) Dict[str, nnabla._variable.Variable][source]

Retrive parameters associated with this model

Parameters

grad_only (bool) – Retrive parameters only with need_grad = True. Defaults to True.

Returns

Parameter map.

Return type

parameters (OrderedDict)

load_parameters(filepath: Union[str, pathlib.Path]) None[source]

Load model parameters from given filepath.

Parameters

filepath (str or pathlib.Path) – paramter file path

save_parameters(filepath: Union[str, pathlib.Path]) None[source]

Save model parameters to given filepath.

Parameters

filepath (str or pathlib.Path) – paramter file path

property scope_name: str

Get scope name of this model.

Returns

scope name of the model

Return type

scope_name (str)

List of Models

class nnabla_rl.models.Perturbator(scope_name)[source]

Bases: nnabla_rl.models.model.Model

DeterministicPolicy Abstract class for perturbator

Perturbator generates noise to append to current state’s action

class nnabla_rl.models.Policy(scope_name: str)[source]

Bases: nnabla_rl.models.model.Model

class nnabla_rl.models.DeterministicPolicy(scope_name: str)[source]

Bases: nnabla_rl.models.policy.Policy

Abstract class for deterministic policy

This policy returns an action for the given state.

abstract pi(s: nnabla._variable.Variable) nnabla._variable.Variable[source]
Parameters

state (nnabla.Variable) – State variable

Returns

Action for the given state

Return type

nnabla.Variable

class nnabla_rl.models.StochasticPolicy(scope_name: str)[source]

Bases: nnabla_rl.models.policy.Policy

Abstract class for stochastic policy

This policy returns a probability distribution of action for the given state.

abstract pi(s: nnabla._variable.Variable) nnabla_rl.distributions.distribution.Distribution[source]
Parameters

state (nnabla.Variable) – State variable

Returns

Probability distribution of the action for the given state

Return type

nnabla_rl.distributions.Distribution

class nnabla_rl.models.QFunction(scope_name: str)[source]

Bases: nnabla_rl.models.model.Model

Base QFunction Class

all_q(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute Q-values for each action for given state

Parameters

s (nn.Variable) – state variable

Returns

Q-values for each action for given state

Return type

nn.Variable

argmax_q(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the action which maximizes the Q-value for given state

Parameters

s (nn.Variable) – state variable

Returns

action which maximizes the Q-value for given state

Return type

nn.Variable

max_q(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute maximum Q-value for given state

Parameters

s (nn.Variable) – state variable

Returns

maximum Q-value value for given state

Return type

nn.Variable

abstract q(s: nnabla._variable.Variable, a: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute Q-value for given state and action

Parameters
  • s (nn.Variable) – state variable

  • a (nn.Variable) – action variable

Returns

Q-value for given state and action

Return type

nn.Variable

class nnabla_rl.models.ValueDistributionFunction(scope_name: str, n_action: int, n_atom: int, v_min: float, v_max: float)[source]

Bases: nnabla_rl.models.model.Model

Base value distribution class.

Computes the probabilities of q-value for each action. Value distribution function models the probabilities of q value for each action by dividing the values between the maximum q value and minimum q value into ‘n_atom’ number of bins and assigning the probability to each bin.

Parameters
  • scope_name (str) – scope name of the model

  • n_action (int) – Number of actions which used in target environment.

  • n_atom (int) – Number of bins.

  • v_min (int) – Minimum value of the distribution.

  • v_max (int) – Maximum value of the distribution.

all_probs(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute probabilities of atoms for all posible actions for given state

Parameters

s (nn.Variable) – state variable

Returns

probabilities of atoms for all posible actions for given state

Return type

nn.Variable

as_q_function() nnabla_rl.models.q_function.QFunction[source]

Convert the value distribution function to QFunction.

Returns

QFunction instance which computes the q-values based on the probabilities.

Return type

nnabla_rl.models.q_function.QFunction

max_q_probs(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute probabilities of atoms for given state that maximizes the q_value

Parameters

s (nn.Variable) – state variable

Returns

probabilities of atoms for given state that maximizes the q_value

Return type

nn.Variable

abstract probs(s: nnabla._variable.Variable, a: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute probabilities of atoms for given state and action

Parameters
  • s (nn.Variable) – state variable

  • a (nn.Variable) – action variable

Returns

probabilities of atoms for given state and action

Return type

nn.Variable

class nnabla_rl.models.QuantileDistributionFunction(scope_name: str, n_action: int, n_quantile: int)[source]

Bases: nnabla_rl.models.model.Model

Base quantile distribution class.

Computes the quantiles of q-value for each action. Quantile distribution function models the quantiles of q value for each action by dividing the probability (which is between 0.0 and 1.0) into ‘n_quantile’ number of bins and assigning the n-quantile to n-th bin.

Parameters
  • scope_name (str) – scope name of the model

  • n_action (int) – Number of actions which used in target environment.

  • n_quantile (int) – Number of bins.

all_quantiles(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Computes the quantiles of q-value for each action for the given state.

Parameters

s (nn.Variable) – state variable

Returns

quantiles of q-value for each action for the given state

Return type

nn.Variable

as_q_function() nnabla_rl.models.q_function.QFunction[source]

Convert the quantile distribution function to QFunction.

Returns

QFunction instance which computes the q-values based on the quantiles.

Return type

nnabla_rl.models.q_function.QFunction

max_q_quantiles(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the quantiles of q-value for given state that maximizes the q_value

Parameters

s (nn.Variable) – state variable

Returns

quantiles of q-value for given state that maximizes the q_value

Return type

nn.Variable

quantiles(s: nnabla._variable.Variable, a: nnabla._variable.Variable) nnabla._variable.Variable[source]

Computes the quantiles of q-value for given state and action.

Parameters
  • s (nn.Variable) – state variable

  • a (nn.Variable) – action variable

Returns

quantiles of q-value for given state and action.

Return type

nn.Variable

class nnabla_rl.models.StateActionQuantileFunction(scope_name: str, n_action: int, K: int, risk_measure_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable] = <function risk_neutral_measure>)[source]

Bases: nnabla_rl.models.model.Model

state-action quantile function class.

Computes the return samples of q-value for each action. State-action quantile function computes the return samples of q value for each action using sampled quantile threshold (e.g. \(\tau\sim U([0,1])\)) for given state.

Parameters
  • scope_name (str) – scope name of the model

  • n_action (int) – Number of actions which used in target environment.

  • K (int) – Number of samples for quantile threshold \(\tau\).

  • risk_measure_function (Callable[[nn.Variable], nn.Variable]) – Risk measure funciton which modifies the weightings of tau. Defaults to risk neutral measure which does not do any change to the taus.

all_quantile_values(s: nnabla._variable.Variable, tau: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the return samples for all action for given state and quantile threshold.

Parameters
  • s (nn.Variable) – state variable.

  • tau (nn.Variable) – quantile threshold.

Returns

return samples from implicit return distribution for given state using tau.

Return type

nn.Variable

as_q_function() nnabla_rl.models.q_function.QFunction[source]

Convert the state action quantile function to QFunction.

Returns

QFunction instance which computes the q-values based on return samples.

Return type

nnabla_rl.models.q_function.QFunction

max_q_quantile_values(s: nnabla._variable.Variable, tau: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the return samples from distribution that maximizes q value for given state using quantile threshold.

Parameters
  • s (nn.Variable) – state variable.

  • tau (nn.Variable) – quantile threshold.

Returns

return samples from implicit return distribution that maximizes q for given state using tau.

Return type

nn.Variable

quantile_values(s: nnabla._variable.Variable, a: nnabla._variable.Variable, tau: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the return samples for given state and action.

Parameters
  • s (nn.Variable) – state variable.

  • a (nn.Variable) – action variable.

  • tau (nn.Variable) – quantile threshold.

Returns

return samples from implicit return distribution for given state and action using tau.

Return type

nn.Variable

sample_tau(shape: Optional[Iterable] = None) nnabla._variable.Variable[source]

Sample quantile thresholds from uniform distribution

Parameters

shape (Tuple[int] or None) – shape of the quantile threshold to sample. If None the shape will be (1, K).

Returns

quantile thresholds

Return type

nn.Variable

class nnabla_rl.models.reward_function.RewardFunction(scope_name: str)[source]

Bases: nnabla_rl.models.model.Model

Base reward function class

abstract r(s_current: nnabla._variable.Variable, a_current: nnabla._variable.Variable, s_next: nnabla._variable.Variable) nnabla._variable.Variable[source]

Computes the reward for the given state, action and next state. One (or more than one) of the input variables may not be used in the actual computation.

Parameters
  • s_current (nnabla.Variable) – State variable

  • a_current (nnabla.Variable) – Action variable

  • s_next (nnabla.Variable) – Next state variable

Returns

Reward for the given state, action and next state.

Return type

nnabla.Variable

class nnabla_rl.models.VFunction(scope_name: str)[source]

Bases: nnabla_rl.models.model.Model

Base Value function class

abstract v(s: nnabla._variable.Variable) nnabla._variable.Variable[source]

Compute the state value (V) for given state

Parameters

s (nn.Variable) – state variable

Returns

State value for given state

Return type

nn.Variable

class nnabla_rl.models.Encoder(scope_name: str)[source]

Bases: nnabla_rl.models.model.Model

abstract encode(x: nnabla._variable.Variable, **kwargs) nnabla._variable.Variable[source]

Encode the input variable to latent representation.

Parameters

x (nn.Variable) – encoder input.

Returns

latent variable

Return type

nn.Variable

class nnabla_rl.models.VariationalAutoEncoder(scope_name: str)[source]

Bases: nnabla_rl.models.encoder.Encoder

abstract decode(z: Optional[nnabla._variable.Variable], **kwargs) nnabla._variable.Variable[source]

Reconstruct the latent representation.

Parameters

z (nn.Variable, optional) – latent variable. If the input is None, random sample will be used instead.

Returns

reconstructed variable

Return type

nn.Variable

abstract decode_multiple(z: Optional[nnabla._variable.Variable], decode_num: int, **kwargs)[source]

Reconstruct multiple latent representations.

Parameters

z (nn.Variable, optional) – encoder input. If the input is None, random sample will be used instead.

Returns

Reconstructed input and latent distribution

Return type

nn.Variable

abstract encode_and_decode(x: nnabla._variable.Variable, **kwargs) Tuple[nnabla_rl.distributions.distribution.Distribution, nnabla._variable.Variable][source]

Encode the input variable and reconstruct.

Parameters

x (nn.Variable) – encoder input.

Returns

latent distribution and reconstructed input

Return type

Tuple[Distribution, nn.Variable]

abstract latent_distribution(x: nnabla._variable.Variable, **kwargs) nnabla_rl.distributions.distribution.Distribution[source]

Compute the latent distribution \(p(z|x)\).

Parameters

x (nn.Variable) – encoder input.

Returns

latent distribution

Return type

Distribution

Parametric functions

nnabla_rl.parametric_functions.noisy_net(inp: nnabla._variable.Variable, n_outmap: int, base_axis: int = 1, w_init: Optional[Callable[[Tuple[int, ...]], numpy.ndarray]] = None, b_init: Optional[Callable[[Tuple[int, ...]], numpy.ndarray]] = None, noisy_w_init: Optional[Callable[[Tuple[int, ...]], numpy.ndarray]] = None, noisy_b_init: Optional[Callable[[Tuple[int, ...]], numpy.ndarray]] = None, fix_parameters: bool = False, rng: Optional[numpy.random.mtrand.RandomState] = None, with_bias: bool = True, with_noisy_bias: bool = True, apply_w: Optional[Callable[[nnabla._variable.Variable], nnabla._variable.Variable]] = None, apply_b: Optional[Callable[[nnabla._variable.Variable], nnabla._variable.Variable]] = None, apply_noisy_w: Optional[Callable[[nnabla._variable.Variable], nnabla._variable.Variable]] = None, apply_noisy_b: Optional[Callable[[nnabla._variable.Variable], nnabla._variable.Variable]] = None, seed: int = - 1) nnabla._variable.Variable[source]

Noisy linear layer with factorized gaussian noise proposed by Fortunato et al. in the paper “Noisy networks for exploration”. See: https://arxiv.org/abs/1706.10295 for details.

Parameters
  • inp (nn.Variable) – Input of the layer n_outmaps (int): output dimension of the layer.

  • n_outmap (int) – Output dimension of the layer.

  • base_axis (int) – Axis of the input to treat as sample dimensions. Dimensions up to base_axis will be treated as sample dimensions. Defaults to 1.

  • w_init (None or Callable[[Tuple[int, ...]], np.ndarray]) – Initializer of weights used in deterministic stream. Defaults to None. If None, will be initialized with Uniform distribution \((-\frac{1}{\sqrt{fanin}},\frac{1}{\sqrt{fanin}})\).

  • b_init (None or Callable[[Tuple[int, ...]], np.ndarray]) – Initializer of bias used in deterministic stream. Defaults to None. If None, will be initialized with Uniform distribution \((-\frac{1}{\sqrt{fanin}},\frac{1}{\sqrt{fanin}})\).

  • noisy_w_init (None or Callable[[Tuple[int, ...]], np.ndarray]) – Initializer of weights used in noisy stream. Defaults to None. If None, will be initialized to a constant value of \(\frac{0.5}{\sqrt{fanin}}\).

  • noisy_b_init (None or Callable[[Tuple[int, ...]], np.ndarray]) – Initializer of bias used in noisy stream. Defaults to None. If None, will be initialized to a constant value of \(\frac{0.5}{\sqrt{fanin}}\).

  • fix_parameters (bool) – If True, underlying weight and bias parameters will Not be updated during training. Default to False.

  • rng (None or np.random.RandomState) – Random number generator for parameter initializer. Defaults to None.

  • with_bias (bool) – If True, deterministic bias term is included in the computation. Defaults to True.

  • with_noisy_bias (bool) – If True, noisy bias term is included in the computation. Defaults to True.

  • apply_w (None or Callable[[nn.Variable], nn.Variable]) – Callable object to apply to the weights on initialization. Defaults to None.

  • apply_b (None or Callable[[nn.Variable], nn.Variable]) – Callable object to apply to the bias on initialization. Defaults to None.

  • apply_noisy_w (None or Callable[[nn.Variable], nn.Variable]) – Callable object to apply to the noisy weight on initialization. Defaults to None.

  • apply_noisy_b (None or Callable[[nn.Variable], nn.Variable]) – Callable object to apply to the noisy bias on initialization. Defaults to None.

  • seed (int) – Random seed. If -1, seed will be sampled from global random number generator. Defaults to -1.

Returns

Linearly transformed input with noisy weights

Return type

nn.Variable

nnabla_rl.parametric_functions.spatial_softmax(inp: nnabla._variable.Variable, alpha_init: float = 1.0, fix_alpha: bool = False) nnabla._variable.Variable[source]

Spatial softmax layer proposed in https://arxiv.org/abs/1509.06113. Computes

\[ \begin{align}\begin{aligned}s_{cij} &= \frac{\exp(x_{cij} / \alpha)}{\sum_{i'j'} \exp(x_{ci'j'} / \alpha)}\\f_{cx} &= \sum_{ij} s_{cij}px_{ij}, f_{cy} = \sum_{ij} s_{cij}py_{ij}\\y_{c} &= (f_{cx}, f_{cy})\end{aligned}\end{align} \]

where \(x, y, \\alpha\) are the input, output and parameter respectively, and \(c, i, j\) are the number of channels, heights and widths respectively. \((px_{ij}, py_{ij})\) is the image-space position of the point (i, j) in the response map.

Parameters
  • inp (nn.Variables) – Input of the layer. Shape should be (batch_size, C, H, W)

  • alpha_init (float) – Initial temperature value. Defaults to 1.

  • fix_alpha (bool) – If True, underlying alpha will Not be updated during training. Defaults to False.

Returns

Feature points, Shape is (batch_size, C*2)

Return type

nn.Variables

ReplayBuffers

All replay_buffers are derived from nnabla_rl.models.ReplayBuffer

ReplayBuffer

class nnabla_rl.replay_buffer.ReplayBuffer(capacity: Optional[int] = None)[source]
append(experience: Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]])[source]

Add new experience to the replay buffer.

Parameters

experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

append_all(experiences: Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]])[source]

Add list of experiences to the replay buffer.

Parameters

experiences (Sequence[Experience]) – Sequence of experiences to insert to the buffer

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

property capacity: Optional[int]

Capacity (max length) of this replay buffer otherwise None

sample(num_samples: int = 1, num_steps: int = 1) Tuple[Union[Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]], Tuple[Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]], ...]], Dict[str, Any]][source]

Randomly sample num_samples experiences from the replay buffer.

Parameters
  • num_samples (int) – Number of samples to sample from the replay buffer. Defaults to 1.

  • num_steps (int) – Number of timesteps to sample. Should be greater than 0. Defaults to 1.

Returns

Random num_samples of experiences. If num_steps is greater than 1, will return a tuple of size num_steps

which contains num_samples of experiences for each entry.

info (Dict[str, Any]): dictionary of information about experiences.

Return type

experiences (Sequence[Experience] or Tuple[Sequence[Experience], …])

Raises

ValueError – num_samples exceeds the maximum possible index or num_steps is 0 or negative.

Notes

Sampling strategy depends on undelying implementation.

sample_indices(indices: Sequence[int], num_steps: int = 1) Tuple[Union[Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]], Tuple[Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]], ...]], Dict[str, Any]][source]

Sample experiences for given indices from the replay buffer.

Parameters
  • indices (array-like) – list of array index to sample the data

  • num_steps (int) – Number of timesteps to sample. Should not be negative. Defaults to 1.

Returns

Random num_samples of experiences. If num_steps is greater than 1, will return a tuple of size num_steps

which contains num_samples of experiences for each entry.

info (Dict[str, Any]): dictionary of information about experiences.

Return type

experiences (Sequence[Experience] or Tuple[Sequence[Experience], …])

Raises

ValueError – If indices are empty or num_steps is 0 or negative.

List of ReplayBuffer

class nnabla_rl.replay_buffers.DecorableReplayBuffer(capacity, decor_fun)[source]

Bases: nnabla_rl.replay_buffer.ReplayBuffer

Buffer which can decorate the experience with external decoration function

This buffer enables decorating the experience before the item is used for building the batch. Decoration function will be called when __getitem__ is called. You can use this buffer to augment the data or add noise to the experience.

class nnabla_rl.replay_buffers.HindsightReplayBuffer(reward_function: Callable[[numpy.ndarray, numpy.ndarray, Dict[str, Any]], Any], hindsight_prob: float = 0.8, capacity: Optional[int] = None)[source]

Bases: nnabla_rl.replay_buffer.ReplayBuffer

append(experience: Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]])[source]

Add new experience to the replay buffer.

Parameters

experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

sample_indices(indices: Sequence[int], num_steps: int = 1) Tuple[Sequence[Tuple[Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], numpy.ndarray, float, float, Union[numpy.ndarray, Tuple[numpy.ndarray, ...]], Dict[str, Any]]], Dict[str, Any]][source]

Sample experiences for given indices from the replay buffer.

Parameters
  • indices (array-like) – list of array index to sample the data

  • num_steps (int) – Number of timesteps to sample. Should not be negative. Defaults to 1.

Returns

Random num_samples of experiences. If num_steps is greater than 1, will return a tuple of size num_steps

which contains num_samples of experiences for each entry.

info (Dict[str, Any]): dictionary of information about experiences.

Return type

experiences (Sequence[Experience] or Tuple[Sequence[Experience], …])

Raises

ValueError – If indices are empty or num_steps is 0 or negative.

class nnabla_rl.replay_buffers.MemoryEfficientAtariBuffer(capacity: int)[source]

Bases: nnabla_rl.replay_buffer.ReplayBuffer

Buffer designed to compactly save experiences of Atari environments used in DQN. DQN (and other training algorithms) requires large replay buffer when training on Atari games. If you naively save the experiences, you’ll need more than 100GB to save them (assuming 1M experiences). Which usually does not fit in the machine’s memory (unless you have money:). This replay buffer reduces the size of experience by casting the images to uint8 and removing old frames concatenated to the observation. By using this buffer, you can hold 1M experiences using only 20GB(approx.) of memory. Note that this class is designed only for DQN style training on atari environment. (i.e. State consists of 4 concatenated grayscaled frames and its values are normalized between 0 and 1)

append(experience)[source]

Add new experience to the replay buffer.

Parameters

experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

class nnabla_rl.replay_buffers.PrioritizedReplayBuffer(capacity: int, alpha: float = 0.6, beta: float = 0.4, betasteps: int = 10000, error_clip: Optional[Tuple[float, float]] = (- 1, 1), epsilon: float = 1e-08, reset_segment_interval: int = 1000, sort_interval: int = 1000000, variant: str = 'proportional')[source]

Bases: nnabla_rl.replay_buffer.ReplayBuffer

append(experience)[source]

Add new experience to the replay buffer.

Parameters

experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

append_all(experiences)[source]

Add list of experiences to the replay buffer.

Parameters

experiences (Sequence[Experience]) – Sequence of experiences to insert to the buffer

Notes

If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.

property capacity

Capacity (max length) of this replay buffer otherwise None

sample(num_samples: int = 1, num_steps: int = 1)[source]

Randomly sample num_samples experiences from the replay buffer.

Parameters
  • num_samples (int) – Number of samples to sample from the replay buffer. Defaults to 1.

  • num_steps (int) – Number of timesteps to sample. Should be greater than 0. Defaults to 1.

Returns

Random num_samples of experiences. If num_steps is greater than 1, will return a tuple of size num_steps

which contains num_samples of experiences for each entry.

info (Dict[str, Any]): dictionary of information about experiences.

Return type

experiences (Sequence[Experience] or Tuple[Sequence[Experience], …])

Raises

ValueError – num_samples exceeds the maximum possible index or num_steps is 0 or negative.

Notes

Sampling strategy depends on undelying implementation.

sample_indices(indices: Sequence[int], num_steps: int = 1)[source]

Sample experiences for given indices from the replay buffer.

Parameters
  • indices (array-like) – list of array index to sample the data

  • num_steps (int) – Number of timesteps to sample. Should not be negative. Defaults to 1.

Returns

Random num_samples of experiences. If num_steps is greater than 1, will return a tuple of size num_steps

which contains num_samples of experiences for each entry.

info (Dict[str, Any]): dictionary of information about experiences.

Return type

experiences (Sequence[Experience] or Tuple[Sequence[Experience], …])

Raises

ValueError – If indices are empty or num_steps is 0 or negative.

Indices and tables