Skip to content


Epsilon-greedy bandit algorithm for regression.

This bandit selects the best arm (defined as the one with the highest average reward) with probability \((1 - \epsilon)\) and draws a random arm with probability \(\epsilon\). It is also called Follow-The-Leader (FTL) algorithm.


  • models (List[base.Estimator])

    The models to compare.

  • metric (river.metrics.base.RegressionMetric) – defaults to None

    Metric used for comparing models with.

  • epsilon (float) – defaults to 0.1

    Exploration parameter (default : 0.1).

  • epsilon_decay (float) – defaults to None

    Exponential decay factor applied to epsilon.

  • explore_each_arm (int) – defaults to 3

    The number of times each arm should explored first.

  • start_after (int) – defaults to 20

    The number of iteration after which the bandit mechanism should begin.

  • seed (int) – defaults to None

    The seed for the algorithm (since not deterministic).


  • best_model

    Returns the best model, defined as the one who maximises average reward.

  • percentage_pulled

    Returns the number of times (in %) each arm has been pulled.


Let's use UCBRegressor to select the best learning rate for a linear regression model. First, we define the grid of models:

>>> from river import compose
>>> from river import linear_model
>>> from river import preprocessing
>>> from river import optim

>>> models = [
...     compose.Pipeline(
...         preprocessing.StandardScaler(),
...         linear_model.LinearRegression(optimizer=optim.SGD(lr=lr))
...     )
...     for lr in [1e-4, 1e-3, 1e-2, 1e-1]
... ]

We decide to use TrumpApproval dataset:

>>> from river import datasets
>>> dataset = datasets.TrumpApproval()

The chosen bandit is epsilon-greedy:

>>> from import EpsilonGreedyRegressor
>>> bandit = EpsilonGreedyRegressor(models=models, seed=1)

The models in the bandit can then be trained in an online fashion.

>>> for x, y in dataset:
...     bandit = bandit.learn_one(x=x, y=y)

We can inspect the number of times (in percentage) each arm has been pulled.

>>> for model, pct in zip(bandit.models, bandit.percentage_pulled):
...     lr = model["LinearRegression"].optimizer.learning_rate
...     print(f"{lr:.1e}{pct:.2%}")
1.0e-04  3.47%
1.0e-03  2.85%
1.0e-02  44.75%
1.0e-01  48.93%

The average reward of each model is also available:

>>> for model, avg in zip(bandit.models, bandit.average_reward):
...     lr = model["LinearRegression"].optimizer.learning_rate
...     print(f"{lr:.1e}{avg:.2f}")
1.0e-04  0.00
1.0e-03  0.00
1.0e-02  0.56
1.0e-01  0.01

We can also select the best model (the one with the highest average reward).

>>> best_model = bandit.best_model

The learning rate chosen by the bandit is:

>>> best_model["LinearRegression"].intercept_lr.learning_rate



Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.


Updates the chosen model and the arm internals (the actual implementation is in Bandit._learn_one).


  • x
  • y

Return the prediction of the best model (defined as the one who maximises average reward).


  • x