# Exp3¶

Exp3 bandit policy.

This policy works by maintaining a weight for each arm. These weights are used to randomly decide which arm to pull. The weights are increased or decreased, depending on the reward. An egalitarianism factor $$\gamma \in [0, 1]$$ is included, to tune the desire to pick an arm uniformly at random. That is, if $$\gamma = 1$$, the arms are picked uniformly at random.

## Parameters¶

• gamma

Typefloat

The egalitarianism factor. Setting this to 0 leads to what is called the EXP3 policy.

• reward_obj

DefaultNone

The reward object used to measure the performance of each arm. This can be a metric, a statistic, or a distribution.

• reward_scaler

DefaultNone

A reward scaler used to scale the rewards before they are fed to the reward object. This can be useful to scale the rewards to a (0, 1) range for instance.

• burn_in

Default0

The number of steps to use for the burn-in phase. Each arm is given the chance to be pulled during the burn-in phase. This is useful to mitigate selection bias.

• seed

Typeint | None

DefaultNone

Random number generator seed for reproducibility.

## Attributes¶

• ranking

Return the list of arms in descending order of performance.

## Examples¶

import gym
from river import bandit
from river import proba
from river import stats

env = gym.make(
'river_bandits/CandyCaneContest-v0'
)
_ = env.reset(seed=42)
_ = env.action_space.seed(123)

policy = bandit.Exp3(gamma=0.5, seed=42)

metric = stats.Sum()
while True:
action = policy.pull(range(env.action_space.n))
observation, reward, terminated, truncated, info = env.step(action)
policy = policy.update(action, reward)
metric = metric.update(reward)
if terminated or truncated:
break

metric

Sum: 799.


## Methods¶

pull

Pull arm(s).

This method is a generator that yields the arm(s) that should be pulled. During the burn-in phase, all the arms that have not been pulled enough times are yielded. Once the burn-in phase is over, the policy is allowed to choose the arm(s) that should be pulled. If you only want to pull one arm at a time during the burn-in phase, simply call next(policy.pull(arms)).

Parameters

• arm_ids'list[ArmID]'

Returns

ArmID: A single arm.

update

Update an arm's state.

Parameters

• arm_id
• reward_args
• reward_kwargs