Upper Confidence Bound (UCB) bandit policy.
Due to the nature of this algorithm, it's recommended to scale the target so that it exhibits sub-gaussian properties. This can be done by using a
Type → float
The confidence level. Setting this to 1 leads to what is called the UCB1 policy.
The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.
The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.
Return the list of arms in descending order of performance.
import gym from river import bandit from river import stats env = gym.make( 'river_bandits/CandyCaneContest-v0' ) _ = env.reset(seed=42) _ = env.action_space.seed(123) policy = bandit.UCB(delta=100) metric = stats.Sum() while True: action = next(policy.pull(range(env.action_space.n))) observation, reward, terminated, truncated, info = env.step(action) policy = policy.update(action, reward) metric = metric.update(reward) if terminated or truncated: break metric
This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call
- arm_ids — 'list[ArmID]'
Update an arm's state.