Skip to content

UCB

Upper Confidence Bound (UCB) bandit policy.

Due to the nature of this algorithm, it's recommended to scale the target so that it exhibits sub-gaussian properties. This can be done by passing a preprocessing.TargetStandardScaler instance to the reward_scaler argument.

Parameters

  • delta

    Typefloat

    The confidence level. Setting this to 1 leads to what is called the UCB1 policy.

  • reward_obj

    DefaultNone

    The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.

  • reward_scaler

    DefaultNone

    A reward scaler used to scale the rewards before they are fed to the reward object. This can be useful to scale the rewards to a (0, 1) range for instance.

  • burn_in

    Default0

    The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.

  • seed

    Typeint | None

    DefaultNone

    Random number generator seed for reproducibility.

Attributes

  • ranking

    Return the list of arms in descending order of performance.

Examples

import gym
from river import bandit
from river import preprocessing
from river import stats

env = gym.make(
    'river_bandits/CandyCaneContest-v0'
)
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

policy = bandit.UCB(
    delta=100,
    reward_scaler=preprocessing.TargetStandardScaler(None),
    seed=42
)

metric = stats.Sum()
while True:
    arm = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(arm)
    policy.update(arm, reward)
    metric.update(reward)
    if terminated or truncated:
        break

metric
Sum: 744.

Methods

pull

Pull arm(s).

This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough times are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call next(policy.pull(arms)).

Parameters

  • arm_ids'list[ArmID]'

Returns

ArmID: A single arm.

update

Update an arm's state.

Parameters

  • arm_id
  • reward_args
  • reward_kwargs