Skip to content

evaluate_offline

Evaluate a policy on historical logs using replay.

This is a high-level utility function for evaluating a policy using the replay methodology. This methodology is an off-policy evaluation method. It does not require an environment, and is instead data-driven.

At each step, an arm is pulled from the provided policy. If the arm is the same as the arm that was pulled in the historical data, the reward is used to update the policy. If the arm is different, the reward is ignored. This is the off-policy aspect of the evaluation.

Parameters

  • policy

    Typebandit.base.Policy

    The policy to evaluate.

  • history

    TypeHistory

    The history of the bandit problem. This is a generator that yields tuples of the form (context, arm, probability, reward). The probability is optional, and is the probability the policy had of picking the arm. If provided, this probability is used to unbias the final score via inverse propensity scoring.

  • reward_stat

    Typestats.base.Univariate

    DefaultNone

    The reward statistic to use. Defaults to stats.Sum.

Examples

import random
from river import bandit

rng = random.Random(42)

arms = ['A', 'B', 'C']
clicks = [
    (
        arms,
        rng.choice(arms),
        (p := rng.random()),
        p > 0.9
    )
    for _ in range(1000)
]

total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.EpsilonGreedy(0.1, seed=42),
    history=clicks,
)

total_reward
Sum: 33.626211

n_samples_used
323