Skip to content

Regression

Regression is about predicting a numeric output for a given sample. A labeled regression sample is made up of a bunch of features and a number. The number is usually continuous, but it may also be discrete. We'll use the Trump approval rating dataset as an example.

from river import datasets

dataset = datasets.TrumpApproval()
dataset


Donald Trump approval ratings.

This dataset was obtained by reshaping the data used by FiveThirtyEight for analyzing Donald
Trump's approval ratings. It contains 5 features, which are approval ratings collected by
5 polling agencies. The target is the approval rating from FiveThirtyEight's model. The goal of
this task is to see if we can reproduce FiveThirtyEight's model.

    Name  TrumpApproval                                                     
    Task  Regression                                                        
 Samples  1,001                                                             
Features  6                                                                 
  Sparse  False                                                             
    Path  /home/runner/work/river/river/river/datasets/trump_approval.csv.gz

This dataset is a streaming dataset which can be looped over.

for x, y in dataset:
    pass

Let's take a look at the first sample.

x, y = next(iter(dataset))
x


{
    'ordinal_date': 736389,
    'gallup': 43.843213,
    'ipsos': 46.19925042857143,
    'morning_consult': 48.318749,
    'rasmussen': 44.104692,
    'you_gov': 43.636914000000004
}

A regression model's goal is to learn to predict a numeric target y from a bunch of features x. We'll attempt to do this with a nearest neighbors model.

from river import neighbors

model = neighbors.KNNRegressor()
model.predict_one(x)


0.0

The model hasn't been trained on any data, and therefore outputs a default value of 0.

The model can be trained on the sample, which will update the model's state.

model = model.learn_one(x, y)

If we try to make a prediction on the same sample, we can see that the output is different, because the model has learned something.

model.predict_one(x)


43.75505

Typically, an online model makes a prediction, and then learns once the ground truth reveals itself. The prediction and the ground truth can be compared to measure the model's correctness. If you have a dataset available, you can loop over it, make a prediction, update the model, and compare the model's output with the ground truth. This is called progressive validation.

from river import metrics

model = neighbors.KNNRegressor()

metric = metrics.MAE()

for x, y in dataset:
    y_pred = model.predict_one(x)
    model.learn_one(x, y)
    metric.update(y, y_pred)

metric


MAE: 0.310353

This is a common way to evaluate an online model. In fact, there is a dedicated evaluate.progressive_val_score function that does this for you.

from river import evaluate

model = neighbors.KNNRegressor()
metric = metrics.MAE()

evaluate.progressive_val_score(dataset, model, metric)


MAE: 0.310353