Environment explorers

All explorers are derived from nnabla_rl.environment_explorer.EnvironmentExplorer.

EnvironmentExplorer

class nnabla_rl.environment_explorer.EnvironmentExplorerConfig(warmup_random_steps: int = 0, reward_scalar: float = 1.0, timelimit_as_terminal: bool = True, initial_step_num: int = 0)[source]
class nnabla_rl.environment_explorer.EnvironmentExplorer(env_info: EnvironmentInfo, config: EnvironmentExplorerConfig = EnvironmentExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0))[source]

Base class for environment exploration methods.

abstract action(steps: int, state: ndarray, *, begin_of_episode: bool = False) Tuple[ndarray, Dict][source]

Compute the action for given state at given timestep.

Parameters:
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

  • begin_of_episode (bool) – Informs the beginning of episode. Used for rnn state reset.

Returns:

action for current state at given timestep

Return type:

np.ndarray

rollout(env: Env) List[Tuple[ndarray | Tuple[ndarray, ...], ndarray, float | ndarray, float, ndarray | Tuple[ndarray, ...], Dict[str, Any]]][source]

Rollout the episode in current env.

Parameters:

env (gym.Env) – Environment

Returns:

List of experience.

Experience consists of (state, action, reward, terminal flag, next state and extra info).

Return type:

List[Experience]

step(env: Env, n: int = 1, break_if_done: bool = False) List[Tuple[ndarray | Tuple[ndarray, ...], ndarray, float | ndarray, float, ndarray | Tuple[ndarray, ...], Dict[str, Any]]][source]

Step n timesteps in given env.

Parameters:
  • env (gym.Env) – Environment

  • n (int) – Number of timesteps to act in the environment

Returns:

List of experience.

Experience consists of (state, action, reward, terminal flag, next state and extra info).

Return type:

List[Experience]

LinearDecayEpsilonGreedyExplorer

class nnabla_rl.environment_explorers.LinearDecayEpsilonGreedyExplorer(greedy_action_selector: ActionSelector, random_action_selector: ActionSelector, env_info: EnvironmentInfo, config: LinearDecayEpsilonGreedyExplorerConfig = LinearDecayEpsilonGreedyExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0, initial_epsilon=1.0, final_epsilon=0.05, max_explore_steps=1000000))[source]

Linear decay epsilon-greedy explorer.

Epsilon-greedy style explorer. Epsilon is linearly decayed until max_eplore_steps set in the config.

Parameters:
  • greedy_action_selector (ActionSelector) – callable which computes greedy action with respect to current state.

  • random_action_selector (ActionSelector) – callable which computes random action that can be executed in the environment.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step: int, state: ndarray, *, begin_of_episode: bool = False) Tuple[ndarray, Dict][source]

Compute the action for given state at given timestep.

Parameters:
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

  • begin_of_episode (bool) – Informs the beginning of episode. Used for rnn state reset.

Returns:

action for current state at given timestep

Return type:

np.ndarray

GaussianExplorer

class nnabla_rl.environment_explorers.GaussianExplorer(policy_action_selector: ActionSelector, env_info: EnvironmentInfo, config: GaussianExplorerConfig = GaussianExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0, action_clip_low=-3.4028235e+38, action_clip_high=3.4028235e+38, sigma=1.0))[source]

Gaussian explorer.

Explore using policy’s action with gaussian noise appended to it. Policy’s action must be continuous action.

Parameters:
  • policy_action_selector (ActionSelector) – callable which computes current policy’s action with respect to current state.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step: int, state: ndarray, *, begin_of_episode: bool = False) Tuple[ndarray, Dict][source]

Compute the action for given state at given timestep.

Parameters:
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

  • begin_of_episode (bool) – Informs the beginning of episode. Used for rnn state reset.

Returns:

action for current state at given timestep

Return type:

np.ndarray

RawPolicyExplorer

class nnabla_rl.environment_explorers.RawPolicyExplorer(policy_action_selector: ActionSelector, env_info: EnvironmentInfo, config: RawPolicyExplorerConfig = RawPolicyExplorerConfig(warmup_random_steps=0, reward_scalar=1.0, timelimit_as_terminal=True, initial_step_num=0))[source]

Raw policy explorer.

Explore using policy’s action without any changes.

Parameters:
  • policy_action_selector (ActionSelector) – callable which computes current policy’s action with respect to current state.

  • env_info (EnvironmentInfo) – environment info

  • config (LinearDecayEpsilonGreedyExplorerConfig) – the config of this class.

action(step: int, state: ndarray, *, begin_of_episode: bool = False) Tuple[ndarray, Dict][source]

Compute the action for given state at given timestep.

Parameters:
  • steps (int) – timesteps since the beginning of exploration

  • state (np.ndarray) – current state to compute the action

  • begin_of_episode (bool) – Informs the beginning of episode. Used for rnn state reset.

Returns:

action for current state at given timestep

Return type:

np.ndarray