evaluate_offline¶

Evaluate a policy on historical logs using replay.

This is a high-level utility function for evaluating a policy using the replay methodology. This methodology is an off-policy evaluation method. It does not require an environment, and is instead data-driven.

At each step, an arm is pulled from the provided policy. If the arm is the same as the arm that was pulled in the historical data, the reward is used to update the policy. If the arm is different, the reward is ignored. This is the off-policy aspect of the evaluation.

Parameters¶

policy

Type → bandit.base.Policy

The policy to evaluate.
history

Type → History

The history of the bandit problem. This is a generator that yields tuples of the form (context, arm, probability, reward). The probability is optional, and is the probability the policy had of picking the arm. If provided, this probability is used to unbias the final score via inverse propensity scoring.
reward_stat

Type → stats.base.Univariate

Default → None

The reward statistic to use. Defaults to stats.Sum.

Examples¶

import random
from river import bandit

rng = random.Random(42)

arms = ['A', 'B', 'C']
clicks = [
    (
        arms,
        rng.choice(arms),
        (p := rng.random()),
        p > 0.9
    )
    for _ in range(1000)
]

total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.EpsilonGreedy(0.1, seed=42),
    history=clicks,
)

total_reward

Sum: 33.626211

n_samples_used