Skip to content

evaluate_offline

Evaluate a policy on historical logs using replay.

This is a high-level utility function for evaluating a policy using the replay methodology. This methodology is an off-policy evaluation method. It does not require an environment, and is instead data-driven.

At each step, an arm is pulled from the provided policy. If the arm is the same as the arm that was pulled in the historical data, the reward is used to update the policy. If the arm is different, the reward is ignored. This is the off-policy aspect of the evaluation.

Parameters

Examples

import random
from river import bandit

rng = random.Random(42)
arms = ['A', 'B', 'C']
clicks = [
    (
        arms,
        # no context
        None,
        # random arm
        rng.choice(arms),
        # reward
        rng.random() > 0.5
    )
    for _ in range(1000)
]

total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.EpsilonGreedy(0.1, seed=42),
    history=clicks,
)

total_reward
Sum: 172.

n_samples_used
321

This also works out of the box with datasets that inherit from river.bandit.BanditDataset.

news = bandit.datasets.NewsArticles()
total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.RandomPolicy(seed=42),
    history=news,
)

total_reward, n_samples_used
(Sum: 105., 1027)

As expected, the policy succeeds in roughly 10% of cases. Indeed, there are 10 arms and 10000 samples, so the expected number of successes is 1000.