Regression¶
Regression is about predicting a numeric output for a given sample. A labeled regression sample is made up of a bunch of features and a number. The number is usually continuous, but it may also be discrete. We'll use the Trump approval rating dataset as an example.
from river import datasets
dataset = datasets.TrumpApproval()
dataset
Donald Trump approval ratings.
This dataset was obtained by reshaping the data used by FiveThirtyEight for analyzing Donald
Trump's approval ratings. It contains [1;36m5[0m features, which are approval ratings collected by
[1;36m5[0m polling agencies. The target is the approval rating from FiveThirtyEight's model. The goal of
this task is to see if we can reproduce FiveThirtyEight's model.
Name TrumpApproval
Task Regression
Samples [1;36m1[0m,[1;36m001[0m
Features [1;36m6[0m
Sparse [3;91mFalse[0m
Path [35m/home/runner/work/river/river/river/datasets/[0m[95mtrump_approval.csv.gz[0m
This dataset is a streaming dataset which can be looped over.
for x, y in dataset:
pass
Let's take a look at the first sample.
x, y = next(iter(dataset))
x
[1m{[0m
[32m'ordinal_date'[0m: [1;36m736389[0m,
[32m'gallup'[0m: [1;36m43.843213[0m,
[32m'ipsos'[0m: [1;36m46.19925042857143[0m,
[32m'morning_consult'[0m: [1;36m48.318749[0m,
[32m'rasmussen'[0m: [1;36m44.104692[0m,
[32m'you_gov'[0m: [1;36m43.636914000000004[0m
[1m}[0m
A regression model's goal is to learn to predict a numeric target y
from a bunch of features x
. We'll attempt to do this with a nearest neighbors model.
from river import neighbors
model = neighbors.KNNRegressor()
model.predict_one(x)
[1;36m0.0[0m
The model hasn't been trained on any data, and therefore outputs a default value of 0.
The model can be trained on the sample, which will update the model's state.
model = model.learn_one(x, y)
If we try to make a prediction on the same sample, we can see that the output is different, because the model has learned something.
model.predict_one(x)
[1;36m43.75505[0m
Typically, an online model makes a prediction, and then learns once the ground truth reveals itself. The prediction and the ground truth can be compared to measure the model's correctness. If you have a dataset available, you can loop over it, make a prediction, update the model, and compare the model's output with the ground truth. This is called progressive validation.
from river import metrics
model = neighbors.KNNRegressor()
metric = metrics.MAE()
for x, y in dataset:
y_pred = model.predict_one(x)
model.learn_one(x, y)
metric.update(y, y_pred)
metric
MAE: [1;36m0.310353[0m
This is a common way to evaluate an online model. In fact, there is a dedicated evaluate.progressive_val_score
function that does this for you.
from river import evaluate
model = neighbors.KNNRegressor()
metric = metrics.MAE()
evaluate.progressive_val_score(dataset, model, metric)
MAE: [1;36m0.310353[0m