Pipeline¶
A pipeline of estimators.
Pipelines allow you to chain different steps into a sequence. Typically, when doing supervised learning, a pipeline contains one ore more transformation steps, whilst it's is a regressor or a classifier. It is highly recommended to use pipelines with River. Indeed, in an online learning setting, it is very practical to have a model defined as a single object. Take a look at the user guide for further information and practical examples.
One special thing to take notice to is the way transformers are handled. It is usual to predict something for a sample and wait for the ground truth to arrive. In such a scenario, the features are seen before the ground truth arrives. Therefore, the unsupervised parts of the pipeline are updated when predict_one
and predict_proba_one
are called. Usually the unsupervised parts of the pipeline are all the steps that precede the final step, which is a supervised model. However, some transformers are supervised and are therefore also updated during calls to learn_one
.
Parameters¶
-
steps
Ideally, a list of (name, estimator) tuples. A name is automatically inferred if none is provided.
Examples¶
The recommended way to declare a pipeline is to use the |
operator. The latter allows you
to chain estimators in a very terse manner:
from river import linear_model
from river import preprocessing
scaler = preprocessing.StandardScaler()
log_reg = linear_model.LinearRegression()
model = scaler | log_reg
This results in a pipeline that stores each step inside a dictionary.
model
Pipeline (
StandardScaler (
with_std=True
),
LinearRegression (
optimizer=SGD (
lr=Constant (
learning_rate=0.01
)
)
loss=Squared ()
l2=0.
l1=0.
intercept_init=0.
intercept_lr=Constant (
learning_rate=0.01
)
clip_gradient=1e+12
initializer=Zeros ()
)
)
You can access parts of a pipeline in the same manner as a dictionary:
model['LinearRegression']
LinearRegression (
optimizer=SGD (
lr=Constant (
learning_rate=0.01
)
)
loss=Squared ()
l2=0.
l1=0.
intercept_init=0.
intercept_lr=Constant (
learning_rate=0.01
)
clip_gradient=1e+12
initializer=Zeros ()
)
Note that you can also declare a pipeline by using the compose.Pipeline
constructor
method, which is slightly more verbose:
from river import compose
model = compose.Pipeline(scaler, log_reg)
By using a compose.TransformerUnion
, you can define complex pipelines that apply
different steps to different parts of the data. For instance, we can extract word counts
from text data, and extract polynomial features from numeric data.
from river import feature_extraction as fx
tfidf = fx.TFIDF('text')
counts = fx.BagOfWords('text')
text_part = compose.Select('text') | (tfidf + counts)
num_part = compose.Select('a', 'b') | fx.PolynomialExtender()
model = text_part + num_part
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression()
The following shows an example of using debug_one
to visualize how the information
flows and changes throughout the pipeline.
from river import compose
from river import naive_bayes
dataset = [
('A positive comment', True),
('A negative comment', False),
('A happy comment', True),
('A lovely comment', True),
('A harsh comment', False)
]
tfidf = fx.TFIDF() | compose.Prefixer('tfidf_')
counts = fx.BagOfWords() | compose.Prefixer('count_')
mnb = naive_bayes.MultinomialNB()
model = (tfidf + counts) | mnb
for x, y in dataset:
model.learn_one(x, y)
x = dataset[0][0]
report = model.debug_one(dataset[0][0])
print(report)
0. Input
--------
A positive comment
1. Transformer union
--------------------
1.0 TFIDF | Prefixer
--------------------
tfidf_comment: 0.43017 (float)
tfidf_positive: 0.90275 (float)
1.1 BagOfWords | Prefixer
-------------------------
count_comment: 1 (int)
count_positive: 1 (int)
count_comment: 1 (int)
count_positive: 1 (int)
tfidf_comment: 0.43017 (float)
tfidf_positive: 0.90275 (float)
2. MultinomialNB
----------------
False: 0.19221
True: 0.80779
Methods¶
debug_one
Displays the state of a set of features as it goes through the pipeline.
Parameters
- x — 'dict'
- show_types — defaults to
True
- n_decimals — defaults to
5
forecast
Return a forecast.
Only works if each estimator has a transform_one
method and the final estimator has a forecast
method. This is the case of time series models from the time_series
module.
Parameters
- horizon — 'int'
- xs — 'list[dict] | None' — defaults to
None
learn_many
Fit to a mini-batch.
Parameters
- X — 'pd.DataFrame'
- y — 'pd.Series | None' — defaults to
None
- params
learn_one
Fit to a single instance.
Parameters
- x — 'dict'
- y — defaults to
None
- params
predict_many
Call transform_many, and then predict_many on the final step.
Parameters
- X — 'pd.DataFrame'
predict_one
Call transform_one
on the first steps and predict_one
on the last step.
Parameters
- x — 'dict'
- params
predict_proba_many
Call transform_many, and then predict_proba_many on the final step.
Parameters
- X — 'pd.DataFrame'
predict_proba_one
Call transform_one
on the first steps and predict_proba_one
on the last step.
Parameters
- x — 'dict'
- params
score_one
Call transform_one
on the first steps and score_one
on the last step.
Parameters
- x — 'dict'
- params
transform_many
Apply each transformer in the pipeline to some features.
The final step in the pipeline will be applied if it is a transformer. If not, then it will be ignored and the output from the penultimate step will be returned. Note that the steps that precede the final step are assumed to all be transformers.
Parameters
- X — 'pd.DataFrame'
transform_one
Apply each transformer in the pipeline to some features.
The final step in the pipeline will be applied if it is a transformer. If not, then it will be ignored and the output from the penultimate step will be returned. Note that the steps that precede the final step are assumed to all be transformers.
Parameters
- x — 'dict'
- params