RandomPolicy¶

Random bandit policy.

This policy simply pulls a random arm at each time step. It is useful as a baseline.

Parameters¶

reward_obj

Default → None

The reward object that is used to update the posterior distribution.
burn_in

Default → 0

Number of initial observations per arm before using the posterior distribution.
seed

Type → int | None

Default → None

Random number generator seed for reproducibility.

Attributes¶

ranking

Return the list of arms in descending order of performance.

Examples¶

import gymnasium as gym
from river import bandit
from river import proba
from river import stats

env = gym.make(
    'river_bandits/CandyCaneContest-v0'
)
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

policy = bandit.RandomPolicy(seed=123)

metric = stats.Sum()
while True:
    action = policy.pull(range(env.action_space.n))
    observation, reward, terminated, truncated, info = env.step(action)
    policy.update(action, reward)
    metric.update(reward)
    if terminated or truncated:
        break

metric

Sum: 755.

Methods¶

pull

Pull arm(s).

This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough times are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call next(policy.pull(arms)).

Parameters

arm_ids — 'list[ArmID]'

Returns

ArmID: A single arm.

update

Update an arm's state.

Parameters

arm_id
reward_args
reward_kwargs