evaluate_offline¶
Evaluate a policy on historical logs using replay.
This is a high-level utility function for evaluating a policy using the replay methodology. This methodology is an off-policy evaluation method. It does not require an environment, and is instead data-driven.
At each step, an arm is pulled from the provided policy. If the arm is the same as the arm that was pulled in the historical data, the reward is used to update the policy. If the arm is different, the reward is ignored. This is the off-policy aspect of the evaluation.
Parameters¶
-
policy
Type → bandit.base.Policy
The policy to evaluate.
-
history
Type → History
The history of the bandit problem. This is a generator that yields tuples of the form
(context, arm, probability, reward)
. The probability is optional, and is the probability the policy had of picking the arm. If provided, this probability is used to unbias the final score via inverse propensity scoring. -
reward_stat
Type → stats.base.Univariate
Default →
None
The reward statistic to use. Defaults to
stats.Sum
.
Examples¶
import random
from river import bandit
rng = random.Random(42)
arms = ['A', 'B', 'C']
clicks = [
(
arms,
rng.choice(arms),
(p := rng.random()),
p > 0.9
)
for _ in range(1000)
]
total_reward, n_samples_used = bandit.evaluate_offline(
policy=bandit.EpsilonGreedy(0.1, seed=42),
history=clicks,
)
total_reward
Sum: 33.626211
n_samples_used
323