\(\varepsilon\)-greedy bandit policy.
Performs arm selection by using an \(\varepsilon\)-greedy bandit strategy. An arm is selected at each step. The best arm is selected (1 - \(\varepsilon\))% of the time.
Selection bias is a common problem when using bandits. This bias can be mitigated by using burn-in phase. Each model is given the chance to learn during the first
The probability of exploring.
decay – defaults to
The decay rate of epsilon.
reward_obj – defaults to
The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.
burn_in – defaults to
The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.
seed (int) – defaults to
Random number generator seed for reproducibility.
The value of epsilon after factoring in the decay rate.
Return the list of arms in descending order of performance.
>>> import gym >>> from river import bandit >>> from river import stats >>> env = gym.make( ... 'river_bandits/CandyCaneContest-v0' ... ) >>> _ = env.reset(seed=42) >>> _ = env.action_space.seed(123) >>> policy = bandit.EpsilonGreedy(epsilon=0.9, seed=101) >>> metric = stats.Sum() >>> while True: ... action = next(policy.pull(range(env.action_space.n))) ... observation, reward, terminated, truncated, info = env.step(action) ... policy = policy.update(action, reward) ... metric = metric.update(reward) ... if terminated or truncated: ... break >>> metric Sum: 775.
This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call
- arm_ids (List[Union[int, str]])
Update an arm's state.