NNabla RL¶
NNablaRL is a deep reinforcement learning library built on top of Neural Network Libraries that is intended to be used for research, development and production.
Getting started¶
Installation¶
Installing nnabla_rl is easy
pip install nnabla_rl
If you would like to install nnabla_rl for development
cd <nnabla_rl root dir>
pip install -e .
API documentation¶
NNabla RL APIs¶
Algorithms¶
All algorithm are derived from nnabla_rl.algorithm.Algorithm
.
Note
Algorithm will run on cpu by default (No matter what nnabla context is set in prior to the instantiation). If you want to run the algorithm on gpu, set the gpu_id through the algorithm’s config. Note that the algorithm will override the nnabla context when the training starts.
Algorithm¶
- class nnabla_rl.algorithm.AlgorithmConfig(gpu_id: int = - 1)[source]¶
List of algorithm common configuration
- Parameters
gpu_id (int) – id of the gpu to use. If negative, the training will run on cpu. Defaults to -1.
- class nnabla_rl.algorithm.Algorithm(env_info, config=AlgorithmConfig(gpu_id=- 1))[source]¶
Base Algorithm class
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – : environment or environment infoconfig (
AlgorithmConfig
) – configuration of the algorithm
Note
Default functions, solvers and configurations are set to the configurations of each algorithm’s original paper. Default functions may not work depending on the environment.
- abstract compute_eval_action(state) → numpy.array[source]¶
Compute action for given state using current best policy. This is usually used for evaluation.
- Parameters
state (np.ndarray) – state to compute the action.
- Returns
Action for given state using current trained policy.
- Return type
np.ndarray
- property iteration_num: int¶
Current iteration number.
- Returns
Current iteration number of running training.
- Return type
int
- property latest_iteration_state: Dict[str, Any]¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
- property max_iterations: int¶
Maximum iteration number of running training.
- Returns
Maximum iteration number of running training.
- Return type
int
- set_hooks(hooks: Sequence[nnabla_rl.hook.Hook])[source]¶
Set hooks for running additional operation during training. Previously set hooks will be removed and replaced with new hooks.
- Parameters
hooks (list of nnabla_rl.hook.Hook) – Hooks to invoke during training
- train(env_or_buffer: Union[gym.core.Env, nnabla_rl.replay_buffer.ReplayBuffer], total_iterations: int)[source]¶
Train the policy with reinforcement learning algorithm
- Parameters
env_or_buffer (Union[gym.Env, ReplayBuffer]) – Target environment to train the policy online or reply buffer to train the policy offline.
total_iterations (int) – Total number of iterations to train the policy.
- Raises
UnsupportedTrainingException – Raises if this algorithm does not support the training method for given parameter.
- train_offline(replay_buffer: gym.core.Env, total_iterations: int)[source]¶
Train the policy using only the replay buffer.
- Parameters
replay_buffer (ReplayBuffer) – Replay buffer to sample experiences to train the policy.
total_iterations (int) – Total number of iterations to train the policy.
- Raises
UnsupportedTrainingException – Raises if the algorithm does not support offline training
- train_online(train_env: gym.core.Env, total_iterations: int)[source]¶
Train the policy by interacting with given environment.
- Parameters
train_env (gym.Env) – Target environment to train the policy.
total_iterations (int) – Total number of iterations to train the policy.
- Raises
UnsupportedTrainingException – Raises if the algorithm does not support online training
A2C¶
- class nnabla_rl.algorithms.a2c.A2CConfig(gpu_id: int = - 1, gamma: float = 0.99, n_steps: int = 5, learning_rate: float = 0.0007, entropy_coefficient: float = 0.01, value_coefficient: float = 0.5, decay: float = 0.99, epsilon: float = 1e-05, start_timesteps: int = 1, actor_num: int = 8, timelimit_as_terminal: bool = False, max_grad_norm: Optional[float] = 0.5, seed: int = - 1)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for A2C algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
n_steps (int) – number of rollout steps. Defaults to 5.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.0007.entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.
value_coefficient (float) – scalar of value loss. Defaults to 0.5.
decay (float) – decay parameter of Adam solver. Defaults to 0.99.
epsilon (float) – epislon of Adam solver. Defaults to 0.00001.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 1.
actor_num (int) – number of parallel actors. Defaults to 8.
timelimit_as_terminal (bool) – Treat as done if the environment reaches the timelimit. Defaults to False.
max_grad_norm (float) – threshold value for clipping gradient. Defaults to 0.5.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.
- class nnabla_rl.algorithms.a2c.A2C(env_or_env_info, v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.a2c.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.a2c.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, config=A2CConfig(gpu_id=-1, gamma=0.99, n_steps=5, learning_rate=0.0007, entropy_coefficient=0.01, value_coefficient=0.5, decay=0.99, epsilon=1e-05, start_timesteps=1, actor_num=8, timelimit_as_terminal=False, max_grad_norm=0.5, seed=-1))[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Advantage Actor-Critic (A2C) algorithm implementation.
This class implements the Advantage Actor-Critic (A2C) algorithm. A2C is the synchronous version of A3C, Asynchronous Advantage Actor-Critic. A3C was proposed by V. Mnih, et al. in the paper: “Asynchronous Methods for Deep Reinforcement Learning” For detail see: https://arxiv.org/abs/1602.01783
This algorithm only supports online training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infov_function_builder (
ModelBuilder[VFunction]
) – builder of v function modelsv_solver_builder (
SolverBuilder
) – builder for v function solverspolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder of policy modelspolicy_solver_builder (
SolverBuilder
) – builder for policy solversconfig (
A2CConfig
) – configuration of A2C algorithm
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
BCQ¶
- class nnabla_rl.algorithms.bcq.BCQConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, phi: float = 0.05, num_q_ensembles: int = 2, num_action_samples: int = 10)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for BCQ algorithm
- Parameters
gamma (float) – discount factor of reward. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001.batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.
phi (float) – action perturbator noise coefficient. Defaults to 0.05.
num_q_ensembles (int) – number of q function ensembles . Defaults to 2.
num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.
- class nnabla_rl.algorithms.bcq.BCQ(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bcq.BCQConfig = BCQConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, phi=0.05, num_q_ensembles=2, num_action_samples=10), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bcq.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, vae_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bcq.DefaultVAEBuilder object>, vae_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>, perturbator_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.perturbator.Perturbator] = <nnabla_rl.algorithms.bcq.DefaultPerturbatorBuilder object>, perturbator_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bcq.DefaultSolverBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Batch-Constrained Q-learning (BCQ) algorithm
This class implements the Batch-Constrained Q-learning (BCQ) algorithm proposed by S. Fujimoto, et al. in the paper: “Off-Policy Deep Reinforcement Learning without Exploration” For details see: https://arxiv.org/abs/1812.02900
This algorithm only supports offline training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
BCQConfig
) – configuration of the BCQ algorithmq_function_builder (
ModelBuilder[QFunction]
) – builder of q-function modelsq_solver_builder (
SolverBuilder
) – builder for q-function solversvae_builder (
ModelBuilder[VariationalAutoEncoder]
) – builder of variational auto encoder modelsvae_solver_builder (
SolverBuilder
) – builder for variational auto encoder solversperturbator_builder (
PerturbatorBuilder
) – builder of perturbator modelsperturbator_solver_builder (
SolverBuilder
) – builder for perturbator solvers
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
BEAR¶
- class nnabla_rl.algorithms.bear.BEARConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, lmb: float = 0.75, epsilon: float = 0.05, num_q_ensembles: int = 2, num_mmd_actions: int = 5, num_action_samples: int = 10, mmd_type: str = 'gaussian', mmd_sigma: float = 20.0, initial_lagrange_multiplier: Optional[float] = None, fix_lagrange_multiplier: bool = False, warmup_iterations: int = 20000, use_mean_for_eval: bool = False)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for BEAR algorithm.
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001.batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
lmb (float) – weight \(\lambda\) used for balancing the ratio between \(\min{Q}\) and \(\max{Q}\) on target q value generation (i.e. \(\lambda\min{Q} + (1 - \lambda)\max{Q}\)). Defaults to 0.75.
epsilon (float) – inequality constraint of dual gradient descent. Defaults to 0.05.
num_q_ensembles (int) – number of q ensembles . Defaults to 2.
num_mmd_actions (int) – number of actions to sample for computing maximum mean discrepancy (MMD). Defaults to 5.
num_action_samples (int) – number of actions to sample for computing target q values. Defaults to 10.
mmd_type (str) – kernel type used for MMD computation. laplacian or gaussian is supported. Defaults to gaussian.
mmd_sigma (float) – parameter used for adjusting the MMD. Defaults to 20.0.
initial_lagrange_multiplier (float, optional) – Initial value of lagrange multiplier. If not specified, random value sampled from normal distribution will be used instead.
fix_lagrange_multiplier (bool) – Either to fix the lagrange multiplier or not. Defaults to False.
warmup_iterations (int) – Number of iterations until start updating the policy. Defaults to 20000
use_mean_for_eval (bool) – Use mean value instead of best action among the samples for evaluation. Defaults to False.
- class nnabla_rl.algorithms.bear.BEAR(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.bear.BEARConfig = BEARConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, lmb=0.75, epsilon=0.05, num_q_ensembles=2, num_mmd_actions=5, num_action_samples=10, mmd_type='gaussian', mmd_sigma=20.0, initial_lagrange_multiplier=None, fix_lagrange_multiplier=False, warmup_iterations=20000, use_mean_for_eval=False), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.bear.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, pi_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.bear.DefaultPolicyBuilder object>, pi_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, vae_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.encoder.VariationalAutoEncoder] = <nnabla_rl.algorithms.bear.DefaultVAEBuilder object>, vae_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>, lagrange_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.bear.DefaultSolverBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Bootstrapping Error Accumulation Reduction (BEAR) algorithm.
This class implements the Bootstrapping Error Accumulation Reduction (BEAR) algorithm proposed by A. Kumar, et al. in the paper: “Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction” For details see: https://arxiv.org/abs/1906.00949
This algorithm only supports offline training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
BEARConfig
) – configuration of the BEAR algorithmq_function_builder (
ModelBuilder[QFunction]
) – builder of q-function modelsq_solver_builder (
SolverBuilder
) – builder for q-function solverspi_function_builder (
ModelBuilder[StochasticPolicy]
) – builder of policy modelspi_solver_builder (
SolverBuilder
) – builder for policy solversvae_builder (
ModelBuilder[VariationalAutoEncoder]
) – builder of variational auto encoder modelsvae_solver_builder (
SolverBuilder
) – builder for variational auto encoder solverslagrange_solver_builder (
SolverBuilder
) – builder for lagrange multiplier solver
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
Categorical DQN¶
- class nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, v_min: float = - 10.0, v_max: float = 10.0, num_atoms: int = 51)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for CategoricalDQN algorithm.
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001.batch_size (int) – training atch size. Defaults to 32.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (float) – the interval of learner update. Defaults to 4
target_update_frequency (float) – the interval of target q-function update. Defaults to 10000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
v_min (float) – lower limit of the value used in value distribution function. Defaults to -10.0.
v_max (float) – upper limit of the value used in value distribution function. Defaults to 10.0.
num_atoms (int) – the number of bins used in value distribution function. Defaults to 51.
- class nnabla_rl.algorithms.categorical_dqn.CategoricalDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.categorical_dqn.CategoricalDQNConfig = CategoricalDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, v_min=-10.0, v_max=10.0, num_atoms=51), value_distribution_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.ValueDistributionFunction] = <nnabla_rl.algorithms.categorical_dqn.DefaultValueDistFunctionBuilder object>, value_distribution_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.categorical_dqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Categorical DQN algorithm.
This class implements the Categorical DQN algorithm proposed by M. Bellemare, et al. in the paper: “A Distributional Perspective on Reinfocement Learning” For details see: https://arxiv.org/abs/1707.06887
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
CategoricalDQNConfig
) – configuration of the CategoricalDQN algorithmvalue_distribution_builder (
ModelBuilder[ValueDistributionFunctionFunction]
) – builder of value distribution function modelsvalue_distribution_solver_builder (
SolverBuilder
) – builder of value distribution function solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
DDPG¶
- class nnabla_rl.algorithms.ddpg.DDPGConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, exploration_noise_sigma: float = 0.1)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for DDPG algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001.batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
exploration_noise_sigma (float) – standard deviation of gaussian exploration noise. Defaults to 0.1.
- class nnabla_rl.algorithms.ddpg.DDPG(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ddpg.DDPGConfig = DDPGConfig(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, exploration_noise_sigma=0.1), critic_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.ddpg.DefaultCriticBuilder object>, critic_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, actor_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.ddpg.DefaultActorBuilder object>, actor_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ddpg.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.ddpg.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Deep Deterministic Policy Gradient (DDPG) algorithm.
This class implements the modified version of the Deep Deterministic Policy Gradient (DDPG) algorithm proposed by T. P. Lillicrap, et al. in the paper: “Continuous control with deep reinforcement learning” For details see: https://arxiv.org/abs/1509.02971 We use gaussian noise instead of Ornstein-Uhlenbeck process to explore in the environment. The effectiveness of using gaussian noise for DDPG is reported in the paper: “Addressing Funciton Approximaiton Error in Actor-Critic Methods”. see https://arxiv.org/abs/1802.09477
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
DDPGConfig
) – configuration of the DDPG algorithmcritic_builder (
ModelBuilder[QFunction]
) – builder of critic modelscritic_solver_builder (
SolverBuilder
) – builder of critic solversactor_builder (
ModelBuilder[DeterministicPolicy]
) – builder of actor modelsactor_solver_builder (
SolverBuilder
) – builder of actor solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- compute_eval_action(state)[source]¶
Compute action for given state using current best policy. This is usually used for evaluation.
- Parameters
state (np.ndarray) – state to compute the action.
- Returns
Action for given state using current trained policy.
- Return type
np.ndarray
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
DQN¶
- class nnabla_rl.algorithms.dqn.DQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00025, batch_size: int = 32, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.1, test_epsilon: float = 0.05, grad_clip: Optional[Tuple[float, float]] = (- 1.0, 1.0))[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for DQN algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00025.batch_size (int) – training atch size. Defaults to 32.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.1.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.05.
grad_clip (Optional[Tuple[float, float]]) – Clip the gradient of final layer. Defaults to (-1.0, 1.0).
- class nnabla_rl.algorithms.dqn.DQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.dqn.DQNConfig = DQNConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00025, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.1, test_epsilon=0.05, grad_clip=(-1.0, 1.0)), q_func_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.dqn.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.dqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.dqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
DQN algorithm.
This class implements the Deep Q-Network (DQN) algorithm proposed by V. Mnih, et al. in the paper: “Human-level control through deep reinforcement learning” For details see: https://www.nature.com/articles/nature14236
Note that default solver used in this implementation is RMSPropGraves as in the original paper. However, in practical applications, we recommend using Adam as the optimizer of DQN. You can replace the solver by implementing a (
SolverBuilder
) and pass the solver on DQN class instantiation.- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
DQNConfig
) – the parameter for DQN trainingq_func_builder (
ModelBuilder
) – builder of q function modelq_solver_builder (
SolverBuilder
) – builder of q function solverreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
GAIL¶
- class nnabla_rl.algorithms.gail.GAILConfig(gpu_id: int = - 1, preprocess_state: bool = True, act_deterministic_in_eval: bool = True, discriminator_batch_size: int = 50000, discriminator_learning_rate: float = 0.01, discriminator_update_frequency: int = 1, adversary_entropy_coef: float = 0.001, policy_update_frequency: int = 1, gamma: float = 0.995, lmb: float = 0.97, pi_batch_size: int = 50000, num_steps_per_iteration: int = 50000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 10, vf_epochs: int = 5, vf_batch_size: int = 128, vf_learning_rate: float = 0.001)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
GAIL config :param act_deterministic_in_eval: Enable act deterministically at evalution. Defaults to True. :type act_deterministic_in_eval: bool :param discriminator_batch_size: Trainig batch size of discriminator. Usually, discriminator_batch_size is the same as pi_batch_size. Defaults to 50000. :type discriminator_batch_size: bool :param discriminator_learning_rate: Learning rate which is set to the solvers of dicriminator function. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001. :type discriminator_learning_rate: float :param discriminator_update_frequency: Frequency (measured in the number of parameter update) of discriminator update. Defaults to 1. :type discriminator_update_frequency: int :param adversary_entropy_coef: Coefficient of entropy loss in dicriminator training. Defaults to 0.001. :type adversary_entropy_coef: float :param policy_update_frequency: Frequency (measured in the number of parameter update) of policy update. Defaults to 1. :type policy_update_frequency: int :param gamma: Discount factor of rewards. Defaults to 0.995. :type gamma: float :param lmb: Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance. :type lmb: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 50000. :type num_steps_per_iteration: int :param pi_batch_size: Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 50000. :type pi_batch_size: int :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 10. :type conjugate_gradient_iterations: int :param vf_epochs: Number of epochs in each iteration. Defaults to 5. :type vf_epochs: int :param vf_batch_size: Training batch size of value function. Defaults to 128. :type vf_batch_size: int :param vf_learning_rate: Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (SolverBuilder
) by yourself. Defaults to 0.001. :type vf_learning_rate: float :param preprocess_state: Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True. :type preprocess_state: bool
- class nnabla_rl.algorithms.gail.GAIL(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], expert_buffer: nnabla_rl.replay_buffer.ReplayBuffer, config: nnabla_rl.algorithms.gail.GAILConfig = GAILConfig(gpu_id=-1, preprocess_state=True, act_deterministic_in_eval=True, discriminator_batch_size=50000, discriminator_learning_rate=0.01, discriminator_update_frequency=1, adversary_entropy_coef=0.001, policy_update_frequency=1, gamma=0.995, lmb=0.97, pi_batch_size=50000, num_steps_per_iteration=50000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=10, vf_epochs=5, vf_batch_size=128, vf_learning_rate=0.001), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.gail.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultVFunctionSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.gail.DefaultPolicyBuilder object>, reward_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.reward_function.RewardFunction] = <nnabla_rl.algorithms.gail.DefaultRewardFunctionBuilder object>, reward_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.gail.DefaultRewardFunctionSolverBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.gail.DefaultPreprocessorBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Generative Adversarial Imitation Learning implementation.
This class implements the Generative Adversarial Imitation Learning (GAIL) algorithm proposed by Jonathan Ho, et al. in the paper: “Generative Adversarial Imitation Learning” For detail see: https://arxiv.org/abs/1606.03476
This algorithm only supports online training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoexpert_buffer (
ReplayBuffer
) – replay buffer which contains expert experience.config (
GAILConfig
) – configuration of GAIL algorithmv_function_builder (
ModelBuilder[VFunction]
) – builder of v function modelsv_solver_builder (
SolverBuilder
) – builder for v function solverspolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder of policy modelsreward_function_builder (
ModelBuilder[RewardFunction]
) – builder of reward function modelsreward_solver_builder (
SolverBuilder
) – builder for reward function solversstate_preprocessor_builder (None or
PreprocessorBuilder
) – state preprocessor builder to preprocess the states
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
IQN¶
- class nnabla_rl.algorithms.iqn.IQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for IQN algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00005.batch_size (int) – training atch size. Defaults to 32.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.
N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.
K (int) – Number of samples to compute greedy next action. Defaults to 32.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.
- class nnabla_rl.algorithms.iqn.IQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.iqn.IQNConfig = IQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64), quantile_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.iqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.iqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Implicit Quantile Network algorithm.
This class implements the Implicit Quantile Network (IQN) algorithm proposed by W. Dabney, et al. in the paper: “Implicit Quantile Networks for Distributional Reinforcement Learning” For details see: https://arxiv.org/pdf/1806.06923.pdf
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
IQNConfig
) – configuration of IQN algorithmquantile_function_builder (
ModelBuilder[StateActionQuantileFunction]
) – buider of state-action quantile function modelsquantile_solver_builder (
SolverBuilder
) – builder for state action quantile function solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
Munchausen DQN¶
- class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, learner_update_frequency: float = 4, target_update_frequency: float = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for Munchausen DQN algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00005.batch_size (int) – training atch size. Defaults to 32.
learner_update_frequency (int) – the interval of learner update. Defaults to 4
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.
munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.
clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.
- class nnabla_rl.algorithms.munchausen_dqn.MunchausenDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_dqn.MunchausenDQNConfig = MunchausenDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), q_func_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.munchausen_dqn.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultQSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.munchausen_dqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Munchausen-DQN algorithm.
This class implements the Munchausen-DQN (Munchausen Deep Q Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
MunchausenDQNConfig
) – configuration of MunchausenDQN algorithmq_func_builder (
ModelBuilder[QFunction]
) – builder of q-function modelsq_solver_builder (
SolverBuilder
) – builder for q-function solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
Munchausen IQN¶
- class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, learner_update_frequency: int = 4, target_update_frequency: int = 10000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, N: int = 64, N_prime: int = 64, K: int = 32, kappa: float = 1.0, embedding_dim: int = 64, entropy_temperature: float = 0.03, munchausen_scaling_term: float = 0.9, clipping_value: float = - 1)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for Munchausen IQN algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00005.batch_size (int) – training atch size. Defaults to 32.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
N (int) – Number of samples to compute the current state’s quantile values. Defaults to 64.
N_prime (int) – Number of samples to compute the target state’s quantile values. Defaults to 64.
K (int) – Number of samples to compute greedy next action. Defaults to 32.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
embedding_dim (int) – dimension of embedding for the sample point. Defaults to 64.
entropy_temperature (float) – temperature parameter of softmax policy distribution. Defaults to 0.03.
munchausen_scaling_term (float) – scalar of scaled log policy. Defaults to 0.9.
clipping_value (float) – Lower value of the logarithm of policy distribution. Defaults to -1.
- class nnabla_rl.algorithms.munchausen_iqn.MunchausenIQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.munchausen_iqn.MunchausenIQNConfig = MunchausenIQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, start_timesteps=50000, replay_buffer_size=1000000, learner_update_frequency=4, target_update_frequency=10000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, N=64, N_prime=64, K=32, kappa=1.0, embedding_dim=64, entropy_temperature=0.03, munchausen_scaling_term=0.9, clipping_value=-1), risk_measure_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable] = <function risk_neutral_measure>, quantile_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.StateActionQuantileFunction] = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileFunctionBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.munchausen_iqn.DefaultQuantileSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.munchausen_iqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Munchausen-IQN algorithm implementation.
This class implements the Munchausen-IQN (Munchausen Implicit Quantile Network) algorithm proposed by N. Vieillard, et al. in the paper: “Munchausen Reinforcement Learning” For details see: https://proceedings.neurips.cc/paper/2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
MunchausenIQNConfig
) – configuration of MunchausenIQN algorithmrisk_measure_function (Callable[[nn.Variable], nn.Variable]) – risk measure function to apply to the quantiles.
quantile_function_builder (
ModelBuilder[StateActionQuantileFunction]
) – builder of state-action quantile function modelsquantile_solver_builder (
SolverBuilder
) – builder for state action quantile function solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
PPO¶
- class nnabla_rl.algorithms.ppo.PPOConfig(gpu_id: int = - 1, epsilon: float = 0.1, gamma: float = 0.99, learning_rate: float = 0.00025, lmb: float = 0.95, entropy_coefficient: float = 0.01, value_coefficient: float = 1.0, actor_num: int = 8, epochs: int = 3, batch_size: int = 256, actor_timesteps: int = 128, total_timesteps: int = 10000, decrease_alpha: bool = True, timelimit_as_terminal: bool = False, seed: int = 1, preprocess_state: bool = True)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for PPO algorithm
- Parameters
epsilon (float) – PPO’s probability ratio clipping range. Defaults to 0.1
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00025.batch_size (int) – training batch size. Defaults to 256.
lmb (float) – scalar of lambda return’s computation in GAE. Defaults to 0.95.
entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.
value_coefficient (float) – scalar of value loss. Defaults to 1.0.
actor_num (int) – Number of parallel actors. Defaults to 8.
epochs (int) – Number of epochs to perform in each training iteration. Defaults to 3.
actor_timesteps (int) – Number of timesteps to interact with the environment by the actors. Defaults to 128.
total_timesteps (int) – Total number of timesteps to interact with the environment. Defaults to 10000.
decrease_alpha (bool) – Flag to control whether to decrease the learning rate linearly during the training. Defaults to True.
timelimit_as_terminal (bool) –
Treat as done if the environment reaches the timelimit. Defaults to False.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.
preprocess_state (bool) – Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True.
- class nnabla_rl.algorithms.ppo.PPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.ppo.PPOConfig = PPOConfig(gpu_id=-1, epsilon=0.1, gamma=0.99, learning_rate=0.00025, lmb=0.95, entropy_coefficient=0.01, value_coefficient=1.0, actor_num=8, epochs=3, batch_size=256, actor_timesteps=128, total_timesteps=10000, decrease_alpha=True, timelimit_as_terminal=False, seed=1, preprocess_state=True), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.ppo.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.ppo.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.ppo.DefaultSolverBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.ppo.DefaultPreprocessorBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Proximal Policy Optimization (PPO) algorithm implementation.
This class implements the Proximal Policy Optimization (PPO) algorithm proposed by J. Schulman, et al. in the paper: “Proximal Policy Optimization Algorithms” For detail see: https://arxiv.org/abs/1707.06347
This algorithm only supports online training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
PPOConfig
) – configuration of PPO algorithmv_function_builder (
ModelBuilder[VFunction]
) – builder of v function modelsv_solver_builder (
SolverBuilder
) – builder for v function solverspolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder of policy modelspolicy_solver_builder (
SolverBuilder
) – builder for policy solversstate_preprocessor_builder (None or
PreprocessorBuilder
) – state preprocessor builder to preprocess the states
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
QRDQN¶
- class nnabla_rl.algorithms.qrdqn.QRDQNConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 5e-05, batch_size: int = 32, learner_update_frequency: int = 4, target_update_frequency: int = 10000, start_timesteps: int = 50000, replay_buffer_size: int = 1000000, max_explore_steps: int = 1000000, initial_epsilon: float = 1.0, final_epsilon: float = 0.01, test_epsilon: float = 0.001, num_quantiles: int = 200, kappa: float = 1.0)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for QRDQN algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.00005.batch_size (int) – training atch size. Defaults to 32.
learner_update_frequency (int) – the interval of learner update. Defaults to 4.
target_update_frequency (int) – the interval of target q-function update. Defaults to 10000.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 50000.
replay_buffer_size (int) – the capacity of replay buffer. Defaults to 1000000.
max_explore_steps (int) – the number of steps decaying the epsilon value. The epsilon will be decayed linearly \(\epsilon=\epsilon_{init} - step\times\frac{\epsilon_{init} - \epsilon_{final}}{max\_explore\_steps}\). Defaults to 1000000.
initial_epsilon (float) – the initial epsilon value for ε-greedy explorer. Defaults to 1.0.
final_epsilon (float) – the last epsilon value for ε-greedy explorer. Defaults to 0.01.
test_epsilon (float) – the epsilon value on testing. Defaults to 0.001.
num_quantiles (int) – Number of quantile points. Defaults to 200.
kappa (float) – threshold value of quantile huber loss. Defaults to 1.0.
- class nnabla_rl.algorithms.qrdqn.QRDQN(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.qrdqn.QRDQNConfig = QRDQNConfig(gpu_id=-1, gamma=0.99, learning_rate=5e-05, batch_size=32, learner_update_frequency=4, target_update_frequency=10000, start_timesteps=50000, replay_buffer_size=1000000, max_explore_steps=1000000, initial_epsilon=1.0, final_epsilon=0.01, test_epsilon=0.001, num_quantiles=200, kappa=1.0), quantile_dist_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.distributional_function.QuantileDistributionFunction] = <nnabla_rl.algorithms.qrdqn.DefaultQuantileBuilder object>, quantile_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.qrdqn.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.qrdqn.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Quantile Regression DQN algorithm.
This class implements the Quantile Regression DQN algorithm proposed by W. Dabney, et al. in the paper: “Distributional Reinforcement Learning with Quantile Regression” For details see: https://arxiv.org/abs/1710.10044
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
QRDQNConfig
) – configuration of QRDQN algorithmquantile_dist_function_builder (
ModelBuilder[QuantileDistributionFunction]
) – builder of quantile distribution function modelsquantile_solver_builder (
SolverBuilder
) – builder for quantile distribution function solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
REINFORCE¶
- class nnabla_rl.algorithms.reinforce.REINFORCEConfig(gpu_id: int = - 1, reward_scale: float = 0.01, num_rollouts_per_train_iteration: int = 10, learning_rate: float = 0.001, clip_grad_norm: float = 1.0, fixed_ln_var: float = - 2.3025850929940455)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
REINFORCE config :param reward_scale: Scale of reward. Defaults to 0.01. :type reward_scale: float :param num_rollouts_per_train_iteration: Number of rollout per each training iteration for collecting on-policy experinces.Increasing this step size is effective to get precise parameters of policy function updating, but computational time of each iteration will increase. Defaults to 10. :type num_rollouts_per_train_iteration: int :param learning_rate: Learning rate which is set to the solvers of policy function. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001. :type learning_rate: float :param clip_grad_norm: Clip to the norm of gradient to this value. Defaults to 1.0. :type clip_grad_norm: float :param fixed_ln_var: Fixed log variance of the policy. This configuration is only valid when the enviroment is continuous. Defaults to 1.0. :type fixed_ln_var: float
- class nnabla_rl.algorithms.reinforce.REINFORCE(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.reinforce.REINFORCEConfig = REINFORCEConfig(gpu_id=-1, reward_scale=0.01, num_rollouts_per_train_iteration=10, learning_rate=0.001, clip_grad_norm=1.0, fixed_ln_var=-2.3025850929940455), policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.reinforce.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.reinforce.DefaultSolverBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
episodic REINFORCE implementation.
This class implements the episodic REINFORCE algorithm proposed by Ronald J. Williams. in the paper: “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” For detail see: https://link.springer.com/content/pdf/10.1007/BF00992696.pdf
This algorithm only supports online training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
REINFORCEConfig
) – configuration of REINFORCE algorithmpolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder for policy function solverspolicy_builder – builder of policy models
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
SAC¶
- class nnabla_rl.algorithms.sac.SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, target_entropy: Optional[float] = None, initial_temperature: Optional[float] = None, fix_temperature: bool = False, start_timesteps: int = 10000, replay_buffer_size: int = 1000000)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for SAC algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.0003.batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
target_entropy (float, optional) – Target entropy value. Defaults to None.
initial_temperature (float, optional) – Initial value of temperature parameter. Defaults to None.
fix_temperature (bool) – If true the temperature parameter will not be trained. Defaults to False.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
- class nnabla_rl.algorithms.sac.SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.sac.SACConfig = SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, target_entropy=None, initial_temperature=None, fix_temperature=False, start_timesteps=10000, replay_buffer_size=1000000), q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.sac.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.sac.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, temperature_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.sac.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.sac.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Soft Actor-Critic (SAC) algorithm implementation.
This class implements the extended version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic Algorithms and Applications” For detail see: https://arxiv.org/abs/1812.05905
This algorithm is slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1801.01290
The temperature parameter is adjusted automatically instead of providing reward scalar as a hyper parameter.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
SACConfig
) – configuration of the SAC algorithmq_function_builder (
ModelBuilder[QFunction]
) – builder of q function modelsq_solver_builder (
SolverBuilder
) – builder of q function solverspolicy_builder (
ModelBuilder[StochasticPolicy]
) – builder of actor modelspolicy_solver_builder (
SolverBuilder
) – builder of policy solverstemperature_solver_builder (
SolverBuilder
) – builder of temperature solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
SAC (ICML 2018 version)¶
- class nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.00030000000000000003, batch_size: int = 256, tau: float = 0.005, environment_steps: int = 1, gradient_steps: int = 1, reward_scalar: float = 5.0, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, target_update_interval: int = 1)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for ICML2018SAC algorithm.
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.0003.batch_size (int) – training batch size. Defaults to 256.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
environment_steps (int) – Number of steps to interact with the environment on each iteration. Defaults to 1.
gradient_steps (int) – Number of parameter updates to perform on each iteration. Defaults to 1.
reward_scalar (float) – Reward scaling factor. Obtained reward will be multiplied by this value. Defaults to 5.0.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
target_update_interval (float) – the interval of target v function parameter’s update. Defaults to 1.
- class nnabla_rl.algorithms.icml2018_sac.ICML2018SAC(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2018_sac.ICML2018SACConfig = ICML2018SACConfig(gpu_id=-1, gamma=0.99, learning_rate=0.00030000000000000003, batch_size=256, tau=0.005, environment_steps=1, gradient_steps=1, reward_scalar=5.0, start_timesteps=10000, replay_buffer_size=1000000, target_update_interval=1), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, q_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.icml2018_sac.DefaultQFunctionBuilder object>, q_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2018_sac.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.icml2018_sac.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Soft Actor-Critic (SAC) algorithm.
This class implements the ICML2018 version of Soft Actor Critic (SAC) algorithm proposed by T. Haarnoja, et al. in the paper: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” For detail see: https://arxiv.org/abs/1801.01290
This implementation slightly differs from the implementation of Soft Actor-Critic algorithm presented also by T. Haarnoja, et al. in the following paper: https://arxiv.org/abs/1812.05905
You will need to scale the reward received from the environment properly to get the algorithm work.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
ICML2018SACConfig
) – configuration of the ICML2018SAC algorithmv_function_builder (
ModelBuilder[VFunction]
) – builder of v function modelsv_solver_builder (
SolverBuilder
) – builder of v function solversq_function_builder (
ModelBuilder[QFunction]
) – builder of q function modelsq_solver_builder (
SolverBuilder
) – builder of q function solverspolicy_builder (
ModelBuilder[StochasticPolicy]
) – builder of actor modelspolicy_solver_builder (
SolverBuilder
) – builder of policy solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
TD3¶
- class nnabla_rl.algorithms.td3.TD3Config(gpu_id: int = - 1, gamma: float = 0.99, learning_rate: float = 0.001, batch_size: int = 100, tau: float = 0.005, start_timesteps: int = 10000, replay_buffer_size: int = 1000000, d: int = 2, exploration_noise_sigma: float = 0.1, train_action_noise_sigma: float = 0.2, train_action_noise_abs: float = 0.5)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
List of configurations for TD3 algorithm
- Parameters
gamma (float) – discount factor of rewards. Defaults to 0.99.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.003.batch_size (int) – training batch size. Defaults to 100.
tau (float) – target network’s parameter update coefficient. Defaults to 0.005.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 10000.
replay_buffer_size (int) – capacity of the replay buffer. Defaults to 1000000.
d (int) – Interval of the policy update. The policy will be updated every d q-function updates. Defaults to 2.
exploration_noise_sigma (float) – Standard deviation of the gaussian exploration noise. Defaults to 0.1.
train_action_noise_sigma (float) – Standard deviation of the gaussian action noise used in the training. Defaults to 0.2.
train_action_noise_abs (float) – Absolute limit value of action noise used in the training. Defaults to 0.5.
- class nnabla_rl.algorithms.td3.TD3(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.td3.TD3Config = TD3Config(gpu_id=-1, gamma=0.99, learning_rate=0.001, batch_size=100, tau=0.005, start_timesteps=10000, replay_buffer_size=1000000, d=2, exploration_noise_sigma=0.1, train_action_noise_sigma=0.2, train_action_noise_abs=0.5), critic_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.q_function.QFunction] = <nnabla_rl.algorithms.td3.DefaultCriticBuilder object>, critic_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, actor_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.DeterministicPolicy] = <nnabla_rl.algorithms.td3.DefaultActorBuilder object>, actor_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.td3.DefaultSolverBuilder object>, replay_buffer_builder: nnabla_rl.builders.replay_buffer_builder.ReplayBufferBuilder = <nnabla_rl.algorithms.td3.DefaultReplayBufferBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Twin Delayed Deep Deterministic policy gradient (TD3) algorithm.
This class implements the Twin Delayed Deep Deteministic policy gradien (TD3) algorithm proposed by S. Fujimoto, et al. in the paper: “Addressing Function Approximation Error in Actor-Critic Methods” For detail see: https://arxiv.org/abs/1802.09477
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
TD3Config
) – configuration of the TD3 algorithmcritic_builder (
ModelBuilder[QFunction]
) – builder of critic modelscritic_solver_builder (
SolverBuilder
) – builder of critic solversactor_builder (
ModelBuilder[DeterministicPolicy]
) – builder of actor modelsactor_solver_builder (
SolverBuilder
) – builder of actor solversreplay_buffer_builder (
ReplayBufferBuilder
) – builder of replay_buffer
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
TRPO¶
- class nnabla_rl.algorithms.trpo.TRPOConfig(gpu_id: int = - 1, gamma: float = 0.995, lmb: float = 0.97, num_steps_per_iteration: int = 5000, pi_batch_size: int = 5000, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.1, conjugate_gradient_iterations: int = 20, vf_epochs: int = 5, vf_batch_size: int = 64, vf_learning_rate: float = 0.001, preprocess_state: bool = True, gpu_batch_size: Optional[int] = None)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
TRPO config :param gamma: Discount factor of rewards. Defaults to 0.995. :type gamma: float :param lmb: Scalar of lambda return’s computation in GAE. Defaults to 0.97. This configuration is related to bias and variance of estimated value. If it is close to 0, estimated value is low-variance but biased. If it is close to 1, estimated value is unbiased but high-variance. :type lmb: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 5000. :type num_steps_per_iteration: int :param pi_batch_size: Trainig batch size of policy. Usually, pi_batch_size is the same as num_steps_per_iteration. Defaults to 5000. :type pi_batch_size: int :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 20. :type conjugate_gradient_iterations: int :param vf_epochs: Number of epochs in each iteration. Defaults to 5. :type vf_epochs: int :param vf_batch_size: Training batch size of value function. Defaults to 64. :type vf_batch_size: int :param vf_learning_rate: Learning rate which is set to the solvers of value function. You can customize/override the learning rate for each solver by implementing the (
SolverBuilder
) by yourself. Defaults to 0.001. :type vf_learning_rate: float :param preprocess_state: Enable preprocessing the states in the collected experiences before feeding as training batch. Defaults to True. :type preprocess_state: bool :param gpu_batch_size: Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None. :type gpu_batch_size: int, optional
- class nnabla_rl.algorithms.trpo.TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.trpo.TRPOConfig = TRPOConfig(gpu_id=-1, gamma=0.995, lmb=0.97, num_steps_per_iteration=5000, pi_batch_size=5000, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.1, conjugate_gradient_iterations=20, vf_epochs=5, vf_batch_size=64, vf_learning_rate=0.001, preprocess_state=True, gpu_batch_size=None), v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.trpo.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.trpo.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.trpo.DefaultPolicyBuilder object>, state_preprocessor_builder: Optional[nnabla_rl.builders.preprocessor_builder.PreprocessorBuilder] = <nnabla_rl.algorithms.trpo.DefaultPreprocessorBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Trust Region Policy Optimiation method with Generalized Advantage Estimation (GAE) implementation.
This class implements the Trust Region Policy Optimiation (TRPO) with Generalized Advantage Estimation (GAE) algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” and “High-Dimensional Continuous Control Using Generalized Advantage Estimation” For detail see: https://arxiv.org/abs/1502.05477 and https://arxiv.org/abs/1506.02438
This algorithm only supports online training.
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
PPOConfig
) – configuration of TRPO algorithmv_function_builder (
ModelBuilder[VFunction]
) – builder of v function modelsv_solver_builder (
SolverBuilder
) – builder for v function solverspolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder of policy modelsstate_preprocessor_builder (None or
PreprocessorBuilder
) – state preprocessor builder to preprocess the states
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
TRPO (ICML 2015 version)¶
- class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig(gpu_id: int = - 1, gamma: float = 0.99, num_steps_per_iteration: int = 100000, batch_size: int = 100000, gpu_batch_size: Optional[int] = None, sigma_kl_divergence_constraint: float = 0.01, maximum_backtrack_numbers: int = 10, conjugate_gradient_damping: float = 0.001, conjugate_gradient_iterations: int = 10)[source]¶
Bases:
nnabla_rl.algorithm.AlgorithmConfig
ICML2015TRPO config :param gamma: Discount factor of rewards. Defaults to 0.99. :type gamma: float :param num_steps_per_iteration: Number of steps per each training iteration for collecting on-policy experinces. Increasing this step size is effective to get precise parameters of policy and value function updating, but computational time of each iteration will increase. Defaults to 100000. :type num_steps_per_iteration: int :param batch_size: Trainig batch size of policy. Usually, batch_size is the same as num_steps_per_iteration. Defaults to 100000. :type batch_size: int :param gpu_batch_size: Actual batch size to reduce one forward gpu calculation memory. As long as gpu memory size is enough, this configuration should not be specified. If not specified, gpu_batch_size is the same as pi_batch_size. Defaults to None. :type gpu_batch_size: int, optional :param sigma_kl_divergence_constraint: Constraint size of kl divergence between previous policy and updated policy. Defaults to 0.01. :type sigma_kl_divergence_constraint: float :param maximum_backtrack_numbers: Maximum backtrack numbers of linesearch. Defaults to 10. :type maximum_backtrack_numbers: int :param conjugate_gradient_damping: Damping size of conjugate gradient method. Defaults to 0.1. :type conjugate_gradient_damping: float :param conjugate_gradient_iterations: Number of iterations of conjugate gradient method. Defaults to 20. :type conjugate_gradient_iterations: int
- class nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPO(env_or_env_info: Union[gym.core.Env, nnabla_rl.environments.environment_info.EnvironmentInfo], config: nnabla_rl.algorithms.icml2015_trpo.ICML2015TRPOConfig = ICML2015TRPOConfig(gpu_id=-1, gamma=0.99, num_steps_per_iteration=100000, batch_size=100000, gpu_batch_size=None, sigma_kl_divergence_constraint=0.01, maximum_backtrack_numbers=10, conjugate_gradient_damping=0.001, conjugate_gradient_iterations=10), policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.icml2015_trpo.DefaultPolicyBuilder object>)[source]¶
Bases:
nnabla_rl.algorithm.Algorithm
Trust Region Policy Optimiation method with Single Path algorithm.
This class implements the Trust Region Policy Optimiation (TRPO) with Single Path algorithm proposed by J. Schulman, et al. in the paper: “Trust Region Policy Optimization” For detail see: https://arxiv.org/abs/1502.05477
- Parameters
env_or_env_info (gym.Env or
EnvironmentInfo
) – the environment to train or environment infoconfig (
ICML2015TRPOConfig
) – configuration of ICML2015TRPO algorithmpolicy_builder (
ModelBuilder[StochasicPolicy]
) – builder of policy models
- property latest_iteration_state¶
Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.
- Returns
Dictionary with items of training process state.
- Return type
Dict[str, Any]
Builders¶
Builder Class
ModelBuilder¶
- class nnabla_rl.builders.ModelBuilder(*args, **kwds)[source]¶
Model builder interface class
- build_model(scope_name: str, env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) → nnabla_rl.builders.model_builder.T[source]¶
Build model.
- Parameters
scope_name (str) – the scope name of model
env_info (
EnvironmentInfo
) – environment informationalgorithm_config (
AlgorithmConfig
) – configuration class of target algorithm. Actual type differs depending on the algorithm.
- Returns
model instance. The type of the model depends on the builder’s generic type.
- Return type
T
PreprocessorBuilder¶
- class nnabla_rl.builders.PreprocessorBuilder[source]¶
Preprocessor builder interface class
- build_preprocessor(scope_name: str, env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) → nnabla_rl.preprocessors.preprocessor.Preprocessor[source]¶
Build preprocessor
- Parameters
scope_name (str) – the scope name of model
env_info (
EnvironmentInfo
) – environment informationalgorithm_config (
AlgorithmConfig
) – configuration class of target algorithm. Actual type differs depending on the algorithm.
- Returns
preprocessor instance.
- Return type
Preprocessor
ReplayBufferBuilder¶
- class nnabla_rl.builders.ReplayBufferBuilder[source]¶
ReplayBuffer builder interface class
- build_replay_buffer(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) → nnabla_rl.replay_buffer.ReplayBuffer[source]¶
Build replay buffer
- Parameters
env_info (
EnvironmentInfo
) – environment informationalgorithm_config (
AlgorithmParam
) – configuration class of the algorithm
- Returns
replay buffer instance.
- Return type
SolverBuilder¶
- class nnabla_rl.builders.SolverBuilder[source]¶
Solver builder interface class
- build_solver(env_info: nnabla_rl.environments.environment_info.EnvironmentInfo, algorithm_config: nnabla_rl.algorithm.AlgorithmConfig, **kwargs) → nnabla.solver.Solver[source]¶
Build solver function
- Parameters
env_info (
EnvironmentInfo
) – environment informationalgorithm_config (
AlgorithmConfig
) – configuration class of the target algorithm
- Returns
solver instance.
- Return type
Solver
Distributions¶
All probability distributions are derived from nnabla_rl.distributions.Distribution
Distribution¶
- class nnabla_rl.distributions.Distribution[source]¶
- choose_probable() → nnabla._variable.Variable[source]¶
Compute the most probable action of the distribution
- Returns
Probable action of the distribution
- Return type
nnabla.Variable
- entropy() → nnabla._variable.Variable[source]¶
Compute the entropy of the distribution
- Returns
Entropy of the distribution
- Return type
nn.Variable
- kl_divergence(q: nnabla_rl.distributions.distribution.Distribution) → nnabla._variable.Variable[source]¶
Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)
- Parameters
q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence
- Returns
Kullback leibler divergence
- Return type
nn.Variable
- Raises
ValueError – target distribution’s type does not match with current distribution type.
- log_prob(x: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the log probability of given input
- Parameters
x (nn.Variable) – Target value to compute the log probability
- Returns
Log probability of given input
- Return type
nn.Variable
- mean() → nnabla._variable.Variable[source]¶
Compute the mean of the distribution (if exist)
- Returns
mean of the distribution
- Return type
nn.Variable
- Raises
NotImplementedError – The distribution does not have mean
- property ndim: int¶
The number of dimensions of the distribution
- abstract sample(noise_clip: Optional[Tuple[float, float]] = None) → nnabla._variable.Variable[source]¶
Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value
- Return type
nn.Variable
- sample_and_compute_log_prob(noise_clip: Optional[Tuple[float, float]] = None) → Tuple[nnabla._variable.Variable, nnabla._variable.Variable][source]¶
Sample a value from the distribution and compute its log probability.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value and its log probabilty
- Return type
Tuple[nn.Variable, nn.Variable]
- sample_multiple(num_samples: int, noise_clip: Optional[Tuple[float, float]] = None) → nnabla._variable.Variable[source]¶
Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)
If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
num_samples (int) – number of samples per batch
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value.
- Return type
nn.Variable
List of Distributions¶
- class nnabla_rl.distributions.Gaussian(mean, ln_var)[source]¶
Bases:
nnabla_rl.distributions.distribution.Distribution
Gaussian distribution
\(\mathcal{N}(\mu,\,\sigma^{2})\)
- Parameters
mean (nn.Variable) – mean \(\mu\) of gaussian distribution.
ln_var (nn.Variable) – logarithm of the variance \(\sigma^{2}\). (i.e. ln_var is \(\log{\sigma^{2}}\))
- choose_probable()[source]¶
Compute the most probable action of the distribution
- Returns
Probable action of the distribution
- Return type
nnabla.Variable
- entropy()[source]¶
Compute the entropy of the distribution
- Returns
Entropy of the distribution
- Return type
nn.Variable
- kl_divergence(q)[source]¶
Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)
- Parameters
q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence
- Returns
Kullback leibler divergence
- Return type
nn.Variable
- Raises
ValueError – target distribution’s type does not match with current distribution type.
- log_prob(x)[source]¶
Compute the log probability of given input
- Parameters
x (nn.Variable) – Target value to compute the log probability
- Returns
Log probability of given input
- Return type
nn.Variable
- mean()[source]¶
Compute the mean of the distribution (if exist)
- Returns
mean of the distribution
- Return type
nn.Variable
- Raises
NotImplementedError – The distribution does not have mean
- property ndim¶
The number of dimensions of the distribution
- sample(noise_clip=None)[source]¶
Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value
- Return type
nn.Variable
- sample_and_compute_log_prob(noise_clip=None)[source]¶
Sample a value from the distribution and compute its log probability.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value and its log probabilty
- Return type
Tuple[nn.Variable, nn.Variable]
- sample_multiple(num_samples, noise_clip=None)[source]¶
Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)
If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
num_samples (int) – number of samples per batch
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value.
- Return type
nn.Variable
- class nnabla_rl.distributions.Softmax(z)[source]¶
Bases:
nnabla_rl.distributions.distribution.Distribution
Softmax distribution which samples a class index \(i\) according to the following probability.
\(i \sim \frac{\exp{z_{i}}}{\sum_{j}\exp{z_{j}}}\).
- Parameters
z (nn.Variable) – logits \(z\). Logits’ dimension should be same as the number of class to sample.
- choose_probable()[source]¶
Compute the most probable action of the distribution
- Returns
Probable action of the distribution
- Return type
nnabla.Variable
- entropy()[source]¶
Compute the entropy of the distribution
- Returns
Entropy of the distribution
- Return type
nn.Variable
- kl_divergence(q)[source]¶
Compute the kullback leibler divergence between given distribution. This function will compute KL(self||q)
- Parameters
q (nnabla_rl.distributions.Distribution) – target distribution to compute the kl_divergence
- Returns
Kullback leibler divergence
- Return type
nn.Variable
- Raises
ValueError – target distribution’s type does not match with current distribution type.
- log_prob(x)[source]¶
Compute the log probability of given input
- Parameters
x (nn.Variable) – Target value to compute the log probability
- Returns
Log probability of given input
- Return type
nn.Variable
- mean()[source]¶
Compute the mean of the distribution (if exist)
- Returns
mean of the distribution
- Return type
nn.Variable
- Raises
NotImplementedError – The distribution does not have mean
- property ndim¶
The number of dimensions of the distribution
- sample(noise_clip=None)[source]¶
Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value
- Return type
nn.Variable
- sample_and_compute_log_prob(noise_clip=None)[source]¶
Sample a value from the distribution and compute its log probability.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value and its log probabilty
- Return type
Tuple[nn.Variable, nn.Variable]
- sample_multiple(num_samples, noise_clip=None)[source]¶
Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)
If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
num_samples (int) – number of samples per batch
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value.
- Return type
nn.Variable
- class nnabla_rl.distributions.SquashedGaussian(mean, ln_var)[source]¶
Bases:
nnabla_rl.distributions.distribution.Distribution
Gaussian distribution which its output is squashed with tanh.
\(z \sim \mathcal{N}(\mu,\,\sigma^{2})\). \(out = \tanh{z}\).
- Parameters
mean (nn.Variable) – mean \(\mu\) of underlying gaussian distribution.
ln_var (nn.Variable) – logarithm of the variance \(\sigma^{2}\). (i.e. ln_var is \(\log{\sigma^{2}}\))
Note
The log probability and kl_divergence of this distribution is different from
Gaussian distribution
because the output is squashed.- choose_probable()[source]¶
Compute the most probable action of the distribution
- Returns
Probable action of the distribution
- Return type
nnabla.Variable
- log_prob(x)[source]¶
Compute the log probability of given input
- Parameters
x (nn.Variable) – Target value to compute the log probability
- Returns
Log probability of given input
- Return type
nn.Variable
- property ndim¶
The number of dimensions of the distribution
- sample(noise_clip=None)[source]¶
Sample a value from the distribution. If noise_clip is specified, the sampled value will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value
- Return type
nn.Variable
- sample_and_compute_log_prob(noise_clip=None)[source]¶
NOTE: In order to avoid sampling different random values for sample and log_prob, you’ll need to use nnabla.forward_all(sample, log_prob) If you forward the two variables independently, you’ll get a log_prob for different sample, since different random variables are sampled internally.
- sample_multiple(num_samples, noise_clip=None)[source]¶
Sample mutiple value from the distribution New axis will be added between the first and second axis. Thefore, the returned value shape for mean and variance with shape (batch_size, data_shape) will be changed to (batch_size, num_samples, data_shape)
If noise_clip is specified, sampled values will be clipped in the given range. Applicability of noise_clip depends on underlying implementation.
- Parameters
num_samples (int) – number of samples per batch
noise_clip (Tuple[float, float], optional) – float tuple of size 2 which contains the min and max value of the noise.
- Returns
Sampled value.
- Return type
nn.Variable
- sample_multiple_and_compute_log_prob(num_samples, noise_clip=None)[source]¶
NOTE: In order to avoid sampling different random values for sample and log_prob, you’ll need to use nnabla.forward_all(sample, log_prob) If you forward the two variables independently, you’ll get a log_prob for different sample, since different random variables are sampled internally.
Environments¶
EnvironmentInfo¶
- class nnabla_rl.environments.environment_info.EnvironmentInfo(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, max_episode_steps: int)[source]¶
Environment Information class
This class contains the basic information of the target training environment.
- property action_dim¶
The dimension of action assuming that the action is flatten.
- property action_shape¶
The shape of action space
- static from_env(env)[source]¶
Create env_info from environment
- Parameters
env (gym.Env) – the environment
- Returns
EnvironmentInfo (
EnvironmentInfo
)
Example
>>> import gym >>> from nnabla_rl.environments.environment_info import EnvironmentInfo >>> env = gym.make("CartPole-v0") >>> env_info = EnvironmentInfo.from_env(env) >>> env_info.state_shape (4,)
- is_continuous_action_env()[source]¶
Check whether the action to execute in the environment is continuous or not
- Returns
True if the action to execute in the environment is continuous. Otherwise False.
- Return type
bool
- is_discrete_action_env()[source]¶
Check whether the action to execute in the environment is discrete or not
- Returns
True if the action to execute in the environment is discrete. Otherwise False.
- Return type
bool
- property state_dim¶
The dimension of state assuming that the state is flatten.
- property state_shape¶
The shape of observation space
Hooks¶
Hook is a utility tool for training.
All hooks are derived from nnabla_rl.hook.Hook
Hook¶
- class nnabla_rl.hook.Hook(timing=1000)[source]¶
Base class of hooks for Algorithm classes.
Hook is called at every ‘timing’ iterations during the training. ‘timing’ is specified at the beginning of the class instantiation.
- abstract on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
List of Hooks¶
- class nnabla_rl.hooks.EvaluationHook(env, evaluator=<nnabla_rl.utils.evaluator.EpisodicEvaluator object>, timing=1000, writer=None)[source]¶
Bases:
nnabla_rl.hook.Hook
Hook to run evaluation during training.
- Parameters
env (gym.Env) – Environment to run the evaluation
evaluator (Callable[[nnabla_rl.algorithm.Algorithm, gym.Env], List[float]]) – Evaluator which runs the actual evaluation. Defaults to
EpisodicEvaluator
.timing (int) – Evaluation interval. Defaults to 1000 iteration.
writer (nnabla_rl.writer.Writer, optional) – Writer instance to save/print the evaluation results. Defaults to None.
- on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
- class nnabla_rl.hooks.IterationNumHook(timing=1)[source]¶
Bases:
nnabla_rl.hook.Hook
Hook to print the iteration number periodically. This hook just prints the iteration number of training.
- Parameters
timing (int) – Printing interval. Defaults to 1 iteration.
- on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
- class nnabla_rl.hooks.IterationStateHook(writer=None, timing=1000)[source]¶
Bases:
nnabla_rl.hook.Hook
Hook which retrieves the iteration state to print/save the training status through writer.
- Parameters
timing (int) – Retriving interval. Defaults to 1000 iteration.
writer (nnabla_rl.writer.Writer, optional) – Writer instance to save/print the iteration states. Defaults to None.
- on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
- class nnabla_rl.hooks.SaveSnapshotHook(outdir, timing=1000)[source]¶
Bases:
nnabla_rl.hook.Hook
Hook to save the training snapshot of current algorithm.
- Parameters
timing (int) – Saving interval. Defaults to 1000 iteration.
- on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
- class nnabla_rl.hooks.TimeMeasuringHook(timing=1)[source]¶
Bases:
nnabla_rl.hook.Hook
Hook to measure and print the actual time spent to run the iteration(s).
- Parameters
timing (int) – Measuring interval. Defaults to 1 iteration.
- on_hook_called(algorithm)[source]¶
Called every “timing” iteration which is set on Hook’s instance creation. Will run additional periodical operation (see each class’ documentation) during the training.
- Parameters
algorithm (nnabla_rl.algorithm.Algorithm) – Algorithm instance to perform additional operation.
Models¶
All models are derived from nnabla_rl.models.Model
Model¶
- class nnabla_rl.models.model.Model(scope_name: str)[source]¶
Model Class
- Parameters
scope_name (str) – the scope name of model
- deepcopy(new_scope_name: str) → nnabla_rl.models.model.Model[source]¶
Create a copy of the model. All the model parameter’s (if exist) associated with will be copied.
- Parameters
new_scope_name (str) – scope_name of parameters for newly created model
- Returns
copied model
- Return type
- Raises
ValueError – Given scope name is same as the model or already exists.
- get_parameters(grad_only: bool = True) → Dict[str, nnabla._variable.Variable][source]¶
Retrive parameters associated with this model
- Parameters
grad_only (bool) – Retrive parameters only with need_grad = True. Defaults to True.
- Returns
Parameter map.
- Return type
parameters (OrderedDict)
- load_parameters(filepath: Union[str, pathlib.Path]) → None[source]¶
Load model parameters from given filepath.
- Parameters
filepath (str or pathlib.Path) – paramter file path
- save_parameters(filepath: Union[str, pathlib.Path]) → None[source]¶
Save model parameters to given filepath.
- Parameters
filepath (str or pathlib.Path) – paramter file path
- property scope_name: str¶
Get scope name of this model.
- Returns
scope name of the model
- Return type
scope_name (str)
List of Models¶
- class nnabla_rl.models.Perturbator(scope_name)[source]¶
Bases:
nnabla_rl.models.model.Model
DeterministicPolicy Abstract class for perturbator
Perturbator generates noise to append to current state’s action
- class nnabla_rl.models.Policy(scope_name: str)[source]¶
Bases:
nnabla_rl.models.model.Model
- class nnabla_rl.models.DeterministicPolicy(scope_name: str)[source]¶
Bases:
nnabla_rl.models.policy.Policy
Abstract class for deterministic policy
This policy returns an action for the given state.
- class nnabla_rl.models.StochasticPolicy(scope_name: str)[source]¶
Bases:
nnabla_rl.models.policy.Policy
Abstract class for stochastic policy
This policy returns a probability distribution of action for the given state.
- abstract pi(s: nnabla._variable.Variable) → nnabla_rl.distributions.distribution.Distribution[source]¶
- Parameters
state (nnabla.Variable) – State variable
- Returns
Probability distribution of the action for the given state
- Return type
- class nnabla_rl.models.QFunction(scope_name: str)[source]¶
Bases:
nnabla_rl.models.model.Model
Base QFunction Class
- all_q(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute Q-values for each action for given state
- Parameters
s (nn.Variable) – state variable
- Returns
Q-values for each action for given state
- Return type
nn.Variable
- argmax_q(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the action which maximizes the Q-value for given state
- Parameters
s (nn.Variable) – state variable
- Returns
action which maximizes the Q-value for given state
- Return type
nn.Variable
- class nnabla_rl.models.ValueDistributionFunction(scope_name: str, n_action: int, n_atom: int, v_min: float, v_max: float)[source]¶
Bases:
nnabla_rl.models.model.Model
Base value distribution class.
Computes the probabilities of q-value for each action. Value distribution function models the probabilities of q value for each action by dividing the values between the maximum q value and minimum q value into ‘n_atom’ number of bins and assigning the probability to each bin.
- Parameters
scope_name (str) – scope name of the model
n_action (int) – Number of actions which used in target environment.
n_atom (int) – Number of bins.
v_min (int) – Minimum value of the distribution.
v_max (int) – Maximum value of the distribution.
- all_probs(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute probabilities of atoms for all posible actions for given state
- Parameters
s (nn.Variable) – state variable
- Returns
probabilities of atoms for all posible actions for given state
- Return type
nn.Variable
- as_q_function() → nnabla_rl.models.q_function.QFunction[source]¶
Convert the value distribution function to QFunction.
- Returns
QFunction instance which computes the q-values based on the probabilities.
- Return type
- max_q_probs(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute probabilities of atoms for given state that maximizes the q_value
- Parameters
s (nn.Variable) – state variable
- Returns
probabilities of atoms for given state that maximizes the q_value
- Return type
nn.Variable
- abstract probs(s: nnabla._variable.Variable, a: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute probabilities of atoms for given state and action
- Parameters
s (nn.Variable) – state variable
a (nn.Variable) – action variable
- Returns
probabilities of atoms for given state and action
- Return type
nn.Variable
- class nnabla_rl.models.QuantileDistributionFunction(scope_name: str, n_action: int, n_quantile: int)[source]¶
Bases:
nnabla_rl.models.model.Model
Base quantile distribution class.
Computes the quantiles of q-value for each action. Quantile distribution function models the quantiles of q value for each action by dividing the probability (which is between 0.0 and 1.0) into ‘n_quantile’ number of bins and assigning the n-quantile to n-th bin.
- Parameters
scope_name (str) – scope name of the model
n_action (int) – Number of actions which used in target environment.
n_quantile (int) – Number of bins.
- all_quantiles(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Computes the quantiles of q-value for each action for the given state.
- Parameters
s (nn.Variable) – state variable
- Returns
quantiles of q-value for each action for the given state
- Return type
nn.Variable
- as_q_function() → nnabla_rl.models.q_function.QFunction[source]¶
Convert the quantile distribution function to QFunction.
- Returns
QFunction instance which computes the q-values based on the quantiles.
- Return type
- max_q_quantiles(s: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the quantiles of q-value for given state that maximizes the q_value
- Parameters
s (nn.Variable) – state variable
- Returns
quantiles of q-value for given state that maximizes the q_value
- Return type
nn.Variable
- quantiles(s: nnabla._variable.Variable, a: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Computes the quantiles of q-value for given state and action.
- Parameters
s (nn.Variable) – state variable
a (nn.Variable) – action variable
- Returns
quantiles of q-value for given state and action.
- Return type
nn.Variable
- class nnabla_rl.models.StateActionQuantileFunction(scope_name: str, n_action: int, K: int, risk_measure_function: Callable[[nnabla._variable.Variable], nnabla._variable.Variable] = <function risk_neutral_measure>)[source]¶
Bases:
nnabla_rl.models.model.Model
state-action quantile function class.
Computes the return samples of q-value for each action. State-action quantile function computes the return samples of q value for each action using sampled quantile threshold (e.g. \(\tau\sim U([0,1])\)) for given state.
- Parameters
scope_name (str) – scope name of the model
n_action (int) – Number of actions which used in target environment.
K (int) – Number of samples for quantile threshold \(\tau\).
risk_measure_function (Callable[[nn.Variable], nn.Variable]) – Risk measure funciton which modifies the weightings of tau. Defaults to risk neutral measure which does not do any change to the taus.
- all_quantile_values(s: nnabla._variable.Variable, tau: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the return samples for all action for given state and quantile threshold.
- Parameters
s (nn.Variable) – state variable.
tau (nn.Variable) – quantile threshold.
- Returns
return samples from implicit return distribution for given state using tau.
- Return type
nn.Variable
- as_q_function() → nnabla_rl.models.q_function.QFunction[source]¶
Convert the state action quantile function to QFunction.
- Returns
QFunction instance which computes the q-values based on return samples.
- Return type
- max_q_quantile_values(s: nnabla._variable.Variable, tau: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the return samples from distribution that maximizes q value for given state using quantile threshold.
- Parameters
s (nn.Variable) – state variable.
tau (nn.Variable) – quantile threshold.
- Returns
return samples from implicit return distribution that maximizes q for given state using tau.
- Return type
nn.Variable
- quantile_values(s: nnabla._variable.Variable, a: nnabla._variable.Variable, tau: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Compute the return samples for given state and action.
- Parameters
s (nn.Variable) – state variable.
a (nn.Variable) – action variable.
tau (nn.Variable) – quantile threshold.
- Returns
return samples from implicit return distribution for given state and action using tau.
- Return type
nn.Variable
- sample_tau(shape: Optional[Iterable] = None) → nnabla._variable.Variable[source]¶
Sample quantile thresholds from uniform distribution
- Parameters
shape (Tuple[int] or None) – shape of the quantile threshold to sample. If None the shape will be (1, K).
- Returns
quantile thresholds
- Return type
nn.Variable
- class nnabla_rl.models.reward_function.RewardFunction(scope_name: str)[source]¶
Bases:
nnabla_rl.models.model.Model
Base reward function class
- abstract r(s_current: nnabla._variable.Variable, a_current: nnabla._variable.Variable, s_next: nnabla._variable.Variable) → nnabla._variable.Variable[source]¶
Computes the reward for the given state, action and next state. One (or more than one) of the input variables may not be used in the actual computation.
- Parameters
s_current (nnabla.Variable) – State variable
a_current (nnabla.Variable) – Action variable
s_next (nnabla.Variable) – Next state variable
- Returns
Reward for the given state, action and next state.
- Return type
nnabla.Variable
- class nnabla_rl.models.VFunction(scope_name: str)[source]¶
Bases:
nnabla_rl.models.model.Model
Base Value function class
- class nnabla_rl.models.Encoder(scope_name: str)[source]¶
Bases:
nnabla_rl.models.model.Model
- class nnabla_rl.models.VariationalAutoEncoder(scope_name: str)[source]¶
Bases:
nnabla_rl.models.encoder.Encoder
- abstract decode(z: Optional[nnabla._variable.Variable], **kwargs) → nnabla._variable.Variable[source]¶
Reconstruct the latent representation.
- Parameters
z (nn.Variable, optional) – latent variable. If the input is None, random sample will be used instead.
- Returns
reconstructed variable
- Return type
nn.Variable
- abstract decode_multiple(z: Optional[nnabla._variable.Variable], decode_num: int, **kwargs)[source]¶
Reconstruct multiple latent representations.
- Parameters
z (nn.Variable, optional) – encoder input. If the input is None, random sample will be used instead.
- Returns
Reconstructed input and latent distribution
- Return type
nn.Variable
- abstract encode_and_decode(x: nnabla._variable.Variable, **kwargs) → Tuple[nnabla_rl.distributions.distribution.Distribution, nnabla._variable.Variable][source]¶
Encode the input variable and reconstruct.
- Parameters
x (nn.Variable) – encoder input.
- Returns
latent distribution and reconstructed input
- Return type
Tuple[Distribution, nn.Variable]
- abstract latent_distribution(x: nnabla._variable.Variable, **kwargs) → nnabla_rl.distributions.distribution.Distribution[source]¶
Compute the latent distribution \(p(z|x)\).
- Parameters
x (nn.Variable) – encoder input.
- Returns
latent distribution
- Return type
ReplayBuffers¶
All replay_buffers are derived from nnabla_rl.models.ReplayBuffer
ReplayBuffer¶
- class nnabla_rl.replay_buffer.ReplayBuffer(capacity: Optional[int] = None)[source]¶
- append(experience: Tuple[Type[numpy.array], Type[numpy.array], float, float, Type[numpy.array], Dict[str, Any]])[source]¶
Add new experience to the replay buffer.
- Parameters
experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)
Notes
If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.
- append_all(experiences: Sequence[Tuple[Type[numpy.array], Type[numpy.array], float, float, Type[numpy.array], Dict[str, Any]]])[source]¶
Add list of experiences to the replay buffer.
- Parameters
experiences (Sequence[Experience]) – Sequence of experiences to insert to the buffer
Notes
If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.
- property capacity: Optional[int]¶
Capacity (max length) of this replay buffer otherwise None
- sample(num_samples: int = 1) → Tuple[Sequence[Tuple[Type[numpy.array], Type[numpy.array], float, float, Type[numpy.array], Dict[str, Any]]], Dict[str, Any]][source]¶
Randomly sample num_samples experiences from the replay buffer.
- Parameters
num_samples (int) – Number of samples to sample from the replay buffer.
- Returns
Random num_samples of experiences. info (Dict[str, Any]): dictionary of information about experiences.
- Return type
experiences (Sequence[Experience])
Notes
Sampling strategy depends on the undelying implementation.
- sample_indices(indices: Sequence[int]) → Tuple[Sequence[Tuple[Type[numpy.array], Type[numpy.array], float, float, Type[numpy.array], Dict[str, Any]]], Dict[str, Any]][source]¶
Sample experiences for given indices from the replay buffer.
- Parameters
indices (array-like) – list of array index to sample the data
- Returns
Sample of experiences for given indices.
- Return type
experiences (array-like)
- Raises
ValueError – If indices are empty
List of ReplayBuffer¶
- class nnabla_rl.replay_buffers.DecorableReplayBuffer(capacity, decor_fun)[source]¶
Bases:
nnabla_rl.replay_buffer.ReplayBuffer
Buffer which can decorate the experience with external decoration function
This buffer enables decorating the experience before the item is used for building the batch. Decoration function will be called when __getitem__ is called. You can use this buffer to augment the data or add noise to the experience.
- class nnabla_rl.replay_buffers.MemoryEfficientAtariBuffer(capacity)[source]¶
Bases:
nnabla_rl.replay_buffer.ReplayBuffer
Buffer designed to compactly save experiences of Atari environments used in DQN.
DQN (and other training algorithms) requires large replay buffer when training on Atari games. If you naively save the experiences, you’ll need more than 100GB to save them (assuming 1M experiences). Which usually does not fit in the machine’s memory (unless you have money:). This replay buffer reduces the size of experience by casting the images to uint8 and removing old frames concatenated to the observation. By using this buffer, you can hold 1M experiences using only 20GB(approx.) of memory.
Note that this class is designed only for DQN style training on atari environment. (i.e. State consists of 4 concatenated grayscaled frames and its values are normalized between 0 and 1)
- append(experience)[source]¶
Add new experience to the replay buffer.
- Parameters
experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)
Notes
If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.
- append_all(experiences)[source]¶
Add list of experiences to the replay buffer.
- Parameters
experiences (Sequence[Experience]) – Sequence of experiences to insert to the buffer
Notes
If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.
- class nnabla_rl.replay_buffers.PrioritizedReplayBuffer(capacity, alpha=0.6, beta=0.4, betasteps=10000, epsilon=1e-08)[source]¶
Bases:
nnabla_rl.replay_buffer.ReplayBuffer
- append(experience)[source]¶
Add new experience to the replay buffer.
- Parameters
experience (array-like) – Experience includes trainsitions, such as state, action, reward, the iteration of environment has done or not. Please see to get more information in [Replay buffer documents](replay_buffer.md)
Notes
If the replay buffer size is full, the oldest (head of the buffer) experience will be dropped off and the given experince will be added to the tail of the buffer.
- sample(num_samples=1)[source]¶
Randomly sample num_samples experiences from the replay buffer.
- Parameters
num_samples (int) – Number of samples to sample from the replay buffer.
- Returns
Random num_samples of experiences. info (Dict[str, Any]): dictionary of information about experiences.
- Return type
experiences (Sequence[Experience])
Notes
Sampling strategy depends on the undelying implementation.