TransformerUnion¶

Packs multiple transformers into a single one.

Pipelines allow you to apply steps sequentially. Therefore, the output of a step becomes the input of the next one. In many cases, you may want to pass the output of a step to multiple steps. This simple transformer allows you to do so. In other words, it enables you to apply particular steps to different parts of an input. A typical example is when you want to scale numeric features and one-hot encode categorical features.

This transformer is essentially a list of transformers. Whenever it is updated, it loops through each transformer and updates them. Meanwhile, calling transform_one collects the output of each transformer and merges them into a single dictionary.

Parameters¶

transformers

Ideally, a list of (name, estimator) tuples. A name is automatically inferred if none is provided.

Examples¶

Take the following dataset:

>>> X = [
...     {'place': 'Taco Bell', 'revenue': 42},
...     {'place': 'Burger King', 'revenue': 16},
...     {'place': 'Burger King', 'revenue': 24},
...     {'place': 'Taco Bell', 'revenue': 58},
...     {'place': 'Burger King', 'revenue': 20},
...     {'place': 'Taco Bell', 'revenue': 50}
... ]

As an example, let's assume we want to compute two aggregates of a dataset. We therefore define two feature_extraction.Aggs and initialize a TransformerUnion with them:

>>> from river import compose
>>> from river import feature_extraction
>>> from river import stats

>>> mean = feature_extraction.Agg(
...     on='revenue', by='place',
...     how=stats.Mean()
... )
>>> count = feature_extraction.Agg(
...     on='revenue', by='place',
...     how=stats.Count()
... )
>>> agg = compose.TransformerUnion(mean, count)

We can now update each transformer and obtain their output with a single function call:

>>> from pprint import pprint
>>> for x in X:
...     agg = agg.learn_one(x)
...     pprint(agg.transform_one(x))
{'revenue_count_by_place': 1, 'revenue_mean_by_place': 42.0}
{'revenue_count_by_place': 1, 'revenue_mean_by_place': 16.0}
{'revenue_count_by_place': 2, 'revenue_mean_by_place': 20.0}
{'revenue_count_by_place': 2, 'revenue_mean_by_place': 50.0}
{'revenue_count_by_place': 3, 'revenue_mean_by_place': 20.0}
{'revenue_count_by_place': 3, 'revenue_mean_by_place': 50.0}

Note that you can use the + operator as a shorthand notation:

agg = mean + count

This allows you to build complex pipelines in a very terse manner. For instance, we can create a pipeline that scales each feature and fits a logistic regression as so:

>>> from river import linear_model as lm
>>> from river import preprocessing as pp

>>> model = (
...     (mean + count) |
...     pp.StandardScaler() |
...     lm.LogisticRegression()
... )

Whice is equivalent to the following code:

>>> model = compose.Pipeline(
...     compose.TransformerUnion(mean, count),
...     pp.StandardScaler(),
...     lm.LogisticRegression()
... )

Note that you access any part of a TransformerUnion by name:

>>> model['TransformerUnion']['Agg']
Agg (
    on="revenue"
    by=['place']
    how=Mean ()
)

>>> model['TransformerUnion']['Agg1']
Agg (
    on="revenue"
    by=['place']
    how=Count ()
)

You can also manually provide a name for each step:

>>> agg = compose.TransformerUnion(
...     ('Mean revenue by place', mean),
...     ('# by place', count)
... )

Mini-batch example:

>>> X = pd.DataFrame([
...     {"place": 2, "revenue": 42},
...     {"place": 3, "revenue": 16},
...     {"place": 3, "revenue": 24},
...     {"place": 2, "revenue": 58},
...     {"place": 3, "revenue": 20},
...     {"place": 2, "revenue": 50},
... ])

Since we need a transformer with mini-batch support to demonstrate, we shall use a StandardScaler.

>>> from river import compose
>>> from river import preprocessing

>>> agg = (
...     compose.Select("place") +
...     (compose.Select("revenue") | preprocessing.StandardScaler())
... )

>>> _ = agg.learn_many(X)
>>> agg.transform_many(X)
   place   revenue
0      2  0.441250
1      3 -1.197680
2      3 -0.693394
3      2  1.449823
4      3 -0.945537
5      2  0.945537

Methods¶

clone

Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.

learn_many

Update each transformer.

Parameters

X (pandas.core.frame.DataFrame)
y (pandas.core.series.Series) – defaults to None

learn_one

Update each transformer.

Parameters

x (dict)
y – defaults to None

transform_many

Passes the data through each transformer and packs the results together.

Parameters

X (pandas.core.frame.DataFrame)

transform_one

Passes the data through each transformer and packs the results together.

Parameters

x (dict)