evaluate_offline¶

Evaluate a policy on historical logs using replay.

This is a high-level utility function for evaluating a policy using the replay methodology. This methodology is an off-policy evaluation method. It does not require an environment, and is instead data-driven.

At each step, an arm is pulled from the provided policy. If the arm is the same as the arm that was pulled in the historical data, the reward is used to update the policy. If the arm is different, the reward is ignored. This is the off-policy aspect of the evaluation.

Parameters¶

policy

Type → bandit.base.Policy

The policy to evaluate.
history

Type → History | bandit.datasets.BanditDataset

The history of the bandit problem. This is a generator that yields tuples of the form (arms, context, arm, reward).
reward_stat

Type → stats.base.Univariate

Default → None

The reward statistic to use. Defaults to stats.Sum.

Examples¶

import random
from river import bandit

rng = random.Random(42)
arms = ['A', 'B', 'C']
clicks = [
    (
        arms,
        # no context
        None,
        # random arm
        rng.choice(arms),
        # reward
        rng.random() > 0.5
    )
    for _ in range(1000)
]

total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.EpsilonGreedy(0.1, seed=42),
    history=clicks,
)

total_reward

Sum: 172.

n_samples_used

This also works out of the box with datasets that inherit from river.bandit.BanditDataset.

news = bandit.datasets.NewsArticles()
total_reward, n_samples_used = bandit.evaluate_offline(
    policy=bandit.RandomPolicy(seed=42),
    history=news,
)

total_reward, n_samples_used

(Sum: 105., 1027)

As expected, the policy succeeds in roughly 10% of cases. Indeed, there are 10 arms and 10000 samples, so the expected number of successes is 1000.