Algorithms¶

All algorithm are derived from nnabla_rl.algorithm.Algorithm.

Note

Algorithm will run on cpu by default (No matter what nnabla context is set in prior to the instantiation). If you want to run the algorithm on gpu, set the gpu_id through the algorithm’s config. Note that the algorithm will override the nnabla context when the training starts.

Algorithm¶

class nnabla_rl.algorithm.AlgorithmConfig(gpu_id: int = - 1)[source]¶

List of algorithm common configuration

Parameters: gpu_id (int) – id of the gpu to use. If negative, the training will run on cpu. Defaults to -1.

class nnabla_rl.algorithm.Algorithm(env_info, config=AlgorithmConfig(gpu_id=- 1))[source]¶

Base Algorithm class

Parameters

env_or_env_info (gym.Env or EnvironmentInfo) – : environment or environment info
config (AlgorithmConfig) – configuration of the algorithm

Note

Default functions, solvers and configurations are set to the configurations of each algorithm’s original paper. Default functions may not work depending on the environment.

abstract compute_eval_action(state) → numpy.array[source]¶

Compute action for given state using current best policy. This is usually used for evaluation.

Parameters: state (np.ndarray) – state to compute the action.
Returns: Action for given state using current trained policy.
Return type: np.ndarray

property iteration_num: int¶

Current iteration number.

Returns: Current iteration number of running training.
Return type: int

property latest_iteration_state: Dict[str, Any]¶

Return latest iteration state that is composed of items of training process state. You can use this state for debugging (e.g. plot loss curve). See [IterationStateHook](./hooks/iteration_state_hook.py) for getting more details.

Returns: Dictionary with items of training process state.
Return type: Dict[str, Any]

property max_iterations: int¶

Maximum iteration number of running training.

Returns: Maximum iteration number of running training.
Return type: int

set_hooks(hooks: Sequence[nnabla_rl.hook.Hook])[source]¶

Set hooks for running additional operation during training. Previously set hooks will be removed and replaced with new hooks.

Parameters: hooks (list of nnabla_rl.hook.Hook) – Hooks to invoke during training

train(env_or_buffer: Union[gym.core.Env, nnabla_rl.replay_buffer.ReplayBuffer], total_iterations: int)[source]¶

Train the policy with reinforcement learning algorithm

Parameters

env_or_buffer (Union[gym.Env, ReplayBuffer]) – Target environment to train the policy online or reply buffer to train the policy offline.
total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if this algorithm does not support the training method for given parameter.

train_offline(replay_buffer: gym.core.Env, total_iterations: int)[source]¶

Train the policy using only the replay buffer.

Parameters

replay_buffer (ReplayBuffer) – Replay buffer to sample experiences to train the policy.
total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if the algorithm does not support offline training

train_online(train_env: gym.core.Env, total_iterations: int)[source]¶

Train the policy by interacting with given environment.

Parameters

train_env (gym.Env) – Target environment to train the policy.
total_iterations (int) – Total number of iterations to train the policy.

Raises

UnsupportedTrainingException – Raises if the algorithm does not support online training

A2C¶

class nnabla_rl.algorithms.a2c.A2CConfig(gpu_id: int = - 1, gamma: float = 0.99, n_steps: int = 5, learning_rate: float = 0.0007, entropy_coefficient: float = 0.01, value_coefficient: float = 0.5, decay: float = 0.99, epsilon: float = 1e-05, start_timesteps: int = 1, actor_num: int = 8, timelimit_as_terminal: bool = False, max_grad_norm: Optional[float] = 0.5, seed: int = - 1)[source]¶

Bases: nnabla_rl.algorithm.AlgorithmConfig

List of configurations for A2C algorithm

Parameters

gamma (float) – discount factor of rewards. Defaults to 0.99.
n_steps (int) – number of rollout steps. Defaults to 5.
learning_rate (float) – learning rate which is set to all solvers. You can customize/override the learning rate for each solver by implementing the (SolverBuilder) by yourself. Defaults to 0.0007.
entropy_coefficient (float) – scalar of entropy regularization term. Defaults to 0.01.
value_coefficient (float) – scalar of value loss. Defaults to 0.5.
decay (float) – decay parameter of Adam solver. Defaults to 0.99.
epsilon (float) – epislon of Adam solver. Defaults to 0.00001.
start_timesteps (int) – the timestep when training starts. The algorithm will collect experiences from the environment by acting randomly until this timestep. Defaults to 1.
actor_num (int) – number of parallel actors. Defaults to 8.
timelimit_as_terminal (bool) – Treat as done if the environment reaches the timelimit. Defaults to False.
max_grad_norm (float) – threshold value for clipping gradient. Defaults to 0.5.
seed (int) – base seed of random number generator used by the actors. Defaults to 1.

class nnabla_rl.algorithms.a2c.A2C(env_or_env_info, v_function_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.v_function.VFunction] = <nnabla_rl.algorithms.a2c.DefaultVFunctionBuilder object>, v_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, policy_builder: nnabla_rl.builders.model_builder.ModelBuilder[nnabla_rl.models.policy.StochasticPolicy] = <nnabla_rl.algorithms.a2c.DefaultPolicyBuilder object>, policy_solver_builder: nnabla_rl.builders.solver_builder.SolverBuilder = <nnabla_rl.algorithms.a2c.DefaultSolverBuilder object>, config=A2CConfig(gpu_id=-1, gamma=0.99, n_steps=5, learning_rate=0.0007, entropy_coefficient=0.01, value_coefficient=0.5, decay=0.99, epsilon=1e-05, start_timesteps=1, actor_num=8, timelimit_as_terminal=False, max_grad_norm=0.5, seed=-1))[source]¶

Algorithms¶

Algorithm¶

A2C¶

BCQ¶

BEAR¶

Categorical DQN¶

DDPG¶

DQN¶

GAIL¶

IQN¶

Munchausen DQN¶

Munchausen IQN¶

PPO¶

QRDQN¶

REINFORCE¶

SAC¶

SAC (ICML 2018 version)¶

TD3¶

TRPO¶

TRPO (ICML 2015 version)¶