Skip to content

EpsilonGreedy

\(\varepsilon\)-greedy bandit policy.

Performs arm selection by using an \(\varepsilon\)-greedy bandit strategy. An arm is selected at each step. The best arm is selected (1 - \(\varepsilon\))% of the time.

Selection bias is a common problem when using bandits. This bias can be mitigated by using burn-in phase. Each model is given the chance to learn during the first burn_in steps.

Parameters

  • epsilon (float)

    The probability of exploring.

  • decay – defaults to 0.0

    The decay rate of epsilon.

  • reward_obj – defaults to None

    The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.

  • burn_in – defaults to 0

    The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.

  • seed (int) – defaults to None

    Random number generator seed for reproducibility.

Attributes

  • current_epsilon

    The value of epsilon after factoring in the decay rate.

  • ranking

    Return the list of arms in descending order of performance.

Examples

>>> import gym
>>> from river import bandit
>>> from river import stats

>>> env = gym.make(
...     'river_bandits/CandyCaneContest-v0'
... )
>>> _ = env.reset(seed=42)
>>> _ = env.action_space.seed(123)

>>> policy = bandit.EpsilonGreedy(epsilon=0.9, seed=101)

>>> metric = stats.Sum()
>>> while True:
...     action = next(policy.pull(range(env.action_space.n)))
...     observation, reward, terminated, truncated, info = env.step(action)
...     policy = policy.update(action, reward)
...     metric = metric.update(reward)
...     if terminated or truncated:
...         break

>>> metric
Sum: 775.

Methods

pull

Pull arm(s).

This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call next(policy.pull(arms)).

Parameters

  • arm_ids (List[Union[int, str]])
update

Update an arm's state.

Parameters

  • arm_id
  • reward_args
  • reward_kwargs

References