evaluate¶
Benchmark a list of policies on a given Gym environment.
This is a high-level utility function for benchmarking a list of policies on a given Gym environment. For example, it can be used to populate a pandas.DataFrame
with the contents of each step of each episode.
Parameters¶
-
policies
Type → list[bandit.base.Policy]
A list of policies to evaluate. The policy will be reset before each episode.
-
env
Type → gym.Env
The Gym environment to use. One copy will be made for each policy at the beginning of each episode.
-
reward_stat
Type → stats.base.Univariate | None
Default →
None
A univariate statistic to keep track of the rewards. This statistic will be reset before each episode. Note that this is not the same as the reward object used by the policies. It's just a statistic to keep track of each policy's performance. If
None
,stats.Sum
is used. -
n_episodes
Type → int
Default →
20
The number of episodes to run.
-
seed
Type → int | None
Default →
None
Random number generator seed for reproducibility. A random number generator will be used to seed differently the environment before each episode.
Examples¶
import gymnasium as gym
from river import bandit
trace = bandit.evaluate(
policies=[
bandit.UCB(delta=1, seed=42),
bandit.EpsilonGreedy(epsilon=0.1, seed=42),
],
env=gym.make(
'river_bandits/CandyCaneContest-v0',
max_episode_steps=100
),
n_episodes=5,
seed=42
)
for step in trace:
print(step)
break
{'episode': 0, 'step': 0, 'policy_idx': 0, 'arm': 81, 'reward': 0.0, 'reward_stat': 0.0}
The return type of this function is a generator. Each step of the generator is a dictionary.
You can pass the generator to a pandas.DataFrame
to get a nice representation of the results.
import pandas as pd
trace = bandit.evaluate(
policies=[
bandit.UCB(delta=1, seed=42),
bandit.EpsilonGreedy(epsilon=0.1, seed=42),
],
env=gym.make(
'river_bandits/CandyCaneContest-v0',
max_episode_steps=100
),
n_episodes=5,
seed=42
)
trace_df = pd.DataFrame(trace)
trace_df.sample(5, random_state=42)
episode step policy_idx arm reward reward_stat
521 2 60 1 25 0.0 36.0
737 3 68 1 40 1.0 20.0
740 3 70 0 58 0.0 36.0
660 3 30 0 31 1.0 16.0
411 2 5 1 35 1.0 5.0
The length of the dataframe is the number of policies times the number of episodes times the maximum number of steps per episode.
len(trace_df)
1000
(
trace_df.policy_idx.nunique() *
trace_df.episode.nunique() *
trace_df.step.nunique()
)
1000