Skip to content


Upper Confidence Bound bandit for regression.

The class offers 2 implementations of UCB:

  • UCB1 from 1, when the parameter delta has value None - UCB(delta) from 2, when the parameter delta is in (0, 1)

For this bandit, rewards are supposed to be 1-subgaussian (see Lattimore and Szepesvári, chapter 6, p. 91) hence the use of the StandardScaler and MaxAbsScaler as reward_scaler.


  • models (List[base.Estimator])

    The models to compare.

  • metric (river.metrics.base.RegressionMetric) – defaults to None

    Metric used for comparing models with.

  • delta (float) – defaults to None

    For UCB(delta) implementation. Lower value means more exploration.

  • explore_each_arm (int) – defaults to 1

    The number of times each arm should explored first.

  • start_after (int) – defaults to 20

    The number of iteration after which the bandit mechanism should begin.

  • seed (int) – defaults to None

    The seed for the algorithm (since not deterministic).


  • best_model

    Returns the best model, defined as the one who maximises average reward.

  • percentage_pulled

    Returns the number of times (in %) each arm has been pulled.


Let's use UCBRegressor to select the best learning rate for a linear regression model. First, we define the grid of models:

>>> from river import compose
>>> from river import linear_model
>>> from river import preprocessing
>>> from river import optim

>>> models = [
...     compose.Pipeline(
...         preprocessing.StandardScaler(),
...         linear_model.LinearRegression(optimizer=optim.SGD(lr=lr))
...     )
...     for lr in [1e-4, 1e-3, 1e-2, 1e-1]
... ]

We decide to use TrumpApproval dataset:

>>> from river import datasets
>>> dataset = datasets.TrumpApproval()

We use the UCB bandit:

>>> from import UCBRegressor
>>> bandit = UCBRegressor(models=models, seed=1)

The models in the bandit can be trained in an online fashion.

>>> for x, y in dataset:
...     bandit = bandit.learn_one(x=x, y=y)

We can inspect the number of times (in percentage) each arm has been pulled.

>>> for model, pct in zip(bandit.models, bandit.percentage_pulled):
...     lr = model["LinearRegression"].optimizer.learning_rate
...     print(f"{lr:.1e}{pct:.2%}")
1.0e-04  2.45%
1.0e-03  2.45%
1.0e-02  92.25%
1.0e-01  2.85%

The average reward of each model is also available:

>>> for model, avg in zip(bandit.models, bandit.average_reward):
...     lr = model["LinearRegression"].optimizer.learning_rate
...     print(f"{lr:.1e}{avg:.2f}")
1.0e-04  0.00
1.0e-03  0.00
1.0e-02  0.74
1.0e-01  0.05

We can also select the best model (the one with the highest average reward).

>>> best_model = bandit.best_model

The learning rate chosen by the bandit is:

>>> best_model["LinearRegression"].intercept_lr.learning_rate



Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.


Updates the chosen model and the arm internals (the actual implementation is in Bandit._learn_one).


  • x
  • y

Return the prediction of the best model (defined as the one who maximises average reward).


  • x