\(\varepsilon\)-greedy bandit policy.
Performs arm selection by using an \(\varepsilon\)-greedy bandit strategy. An arm is selected at each step. The best arm is selected (1 - \(\varepsilon\))% of the time.
Selection bias is a common problem when using bandits. This bias can be mitigated by using burn-in phase. Each model is given the chance to learn during the first
Type → float
The probability of exploring.
The decay rate of epsilon.
The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.
The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.
Type → int | None
Random number generator seed for reproducibility.
The value of epsilon after factoring in the decay rate.
Return the list of arms in descending order of performance.
import gym from river import bandit from river import stats env = gym.make( 'river_bandits/CandyCaneContest-v0' ) _ = env.reset(seed=42) _ = env.action_space.seed(123) policy = bandit.EpsilonGreedy(epsilon=0.9, seed=101) metric = stats.Sum() while True: action = next(policy.pull(range(env.action_space.n))) observation, reward, terminated, truncated, info = env.step(action) policy = policy.update(action, reward) metric = metric.update(reward) if terminated or truncated: break metric
This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call
- arm_ids — 'list[ArmID]'
Update an arm's state.