evaluate¶

Benchmark a list of policies on a given Gym environment.

This is a high-level utility function for benchmarking a list of policies on a given Gym environment. For example, it can be used to populate a pandas.DataFrame with the contents of each step of each episode.

Parameters¶

policies (List[river.bandit.base.Policy])

A list of policies to evaluate. The policy will be reset before each episode.
env ('gym.Env')

The Gym environment to use. One copy will be made for each policy at the beginning of each episode.
pull_func (Callable[[river.bandit.base.Policy, ForwardRef('gym.Env')], Union[int, str]])

A function that takes a policy and an environment as arguments and returns the arm that was pulled. This function is called at each time step for each policy. This is required because there is no standard way to pull an arm in Gym environments.
reward_stat (river.stats.base.Univariate) – defaults to None

A univariate statistic to keep track of the rewards. This statistic will be reset before each episode. Note that this is not the same as the reward object used by the policies. It's just a statistic to keep track of each policy's performance. If None, stats.Sum is used.
n_episodes (int) – defaults to 20

The number of episodes to run.
seed (int) – defaults to None

Random number generator seed for reproducibility. A random number generator will be used to seed differently the environment before each episode.

Examples¶

>>> import gym
>>> from river import bandit

>>> def pull_func(policy, env):
...     return next(policy.pull(range(env.action_space.n)))

>>> trace = bandit.evaluate(
...     policies=[
...         bandit.UCB(delta=1),
...         bandit.EpsilonGreedy(epsilon=0.1, seed=42),
...     ],
...     env=gym.make(
...         'river_bandits/CandyCaneContest-v0',
...         max_episode_steps=100
...     ),
...     pull_func=pull_func,
...     n_episodes=5,
...     seed=42
... )

>>> for step in trace:
...     print(step)
...     break
{'episode': 0, 'step': 0, 'policy_idx': 0, 'action': 0, 'reward': 0.0, 'reward_stat': 0.0}

The return type of this function is a generator. Each step of the generator is a dictionary. You can pass the generator to a pandas.DataFrame to get a nice representation of the results.

>>> import pandas as pd

>>> trace = bandit.evaluate(
...     policies=[
...         bandit.UCB(delta=1),
...         bandit.EpsilonGreedy(epsilon=0.1, seed=42),
...     ],
...     env=gym.make(
...         'river_bandits/CandyCaneContest-v0',
...         max_episode_steps=100
...     ),
...     pull_func=pull_func,
...     n_episodes=5,
...     seed=42
... )

>>> trace_df = pd.DataFrame(trace)
>>> trace_df.sample(5, random_state=42)
     episode  step  policy_idx  action  reward  reward_stat
521        2    60           1      25     0.0         36.0
737        3    68           1      40     1.0         20.0
740        3    70           0      70     1.0         33.0
660        3    30           0      30     1.0         13.0
411        2     5           1      35     1.0          5.0

The length of the dataframe is the number of policies times the number of episodes times the maximum number of steps per episode.

>>> len(trace_df)
1000

>>> (
...     trace_df.policy_idx.nunique() *
...     trace_df.episode.nunique() *
...     trace_df.step.nunique()
... )
1000