TransformerUnion¶
Packs multiple transformers into a single one.
Pipelines allow you to apply steps sequentially. Therefore, the output of a step becomes the input of the next one. In many cases, you may want to pass the output of a step to multiple steps. This simple transformer allows you to do so. In other words, it enables you to apply particular steps to different parts of an input. A typical example is when you want to scale numeric features and one-hot encode categorical features.
This transformer is essentially a list of transformers. Whenever it is updated, it loops through each transformer and updates them. Meanwhile, calling transform_one
collects the output of each transformer and merges them into a single dictionary.
Parameters¶
-
transformers
Ideally, a list of (name, estimator) tuples. A name is automatically inferred if none is provided.
Examples¶
Take the following dataset:
>>> X = [
... {'place': 'Taco Bell', 'revenue': 42},
... {'place': 'Burger King', 'revenue': 16},
... {'place': 'Burger King', 'revenue': 24},
... {'place': 'Taco Bell', 'revenue': 58},
... {'place': 'Burger King', 'revenue': 20},
... {'place': 'Taco Bell', 'revenue': 50}
... ]
As an example, let's assume we want to compute two aggregates of a dataset. We therefore
define two feature_extraction.Agg
s and initialize a TransformerUnion
with them:
>>> from river import compose
>>> from river import feature_extraction
>>> from river import stats
>>> mean = feature_extraction.Agg(
... on='revenue', by='place',
... how=stats.Mean()
... )
>>> count = feature_extraction.Agg(
... on='revenue', by='place',
... how=stats.Count()
... )
>>> agg = compose.TransformerUnion(mean, count)
We can now update each transformer and obtain their output with a single function call:
>>> from pprint import pprint
>>> for x in X:
... agg = agg.learn_one(x)
... pprint(agg.transform_one(x))
{'revenue_count_by_place': 1, 'revenue_mean_by_place': 42.0}
{'revenue_count_by_place': 1, 'revenue_mean_by_place': 16.0}
{'revenue_count_by_place': 2, 'revenue_mean_by_place': 20.0}
{'revenue_count_by_place': 2, 'revenue_mean_by_place': 50.0}
{'revenue_count_by_place': 3, 'revenue_mean_by_place': 20.0}
{'revenue_count_by_place': 3, 'revenue_mean_by_place': 50.0}
Note that you can use the +
operator as a shorthand notation:
agg = mean + count
This allows you to build complex pipelines in a very terse manner. For instance, we can create a pipeline that scales each feature and fits a logistic regression as so:
>>> from river import linear_model as lm
>>> from river import preprocessing as pp
>>> model = (
... (mean + count) |
... pp.StandardScaler() |
... lm.LogisticRegression()
... )
Whice is equivalent to the following code:
>>> model = compose.Pipeline(
... compose.TransformerUnion(mean, count),
... pp.StandardScaler(),
... lm.LogisticRegression()
... )
Note that you access any part of a TransformerUnion
by name:
>>> model['TransformerUnion']['Agg']
Agg (
on="revenue"
by=['place']
how=Mean ()
)
>>> model['TransformerUnion']['Agg1']
Agg (
on="revenue"
by=['place']
how=Count ()
)
You can also manually provide a name for each step:
>>> agg = compose.TransformerUnion(
... ('Mean revenue by place', mean),
... ('# by place', count)
... )
Mini-batch example:
>>> X = pd.DataFrame([
... {"place": 2, "revenue": 42},
... {"place": 3, "revenue": 16},
... {"place": 3, "revenue": 24},
... {"place": 2, "revenue": 58},
... {"place": 3, "revenue": 20},
... {"place": 2, "revenue": 50},
... ])
Since we need a transformer with mini-batch support to demonstrate, we shall use
a StandardScaler
.
>>> from river import compose
>>> from river import preprocessing
>>> agg = (
... compose.Select("place") +
... (compose.Select("revenue") | preprocessing.StandardScaler())
... )
>>> _ = agg.learn_many(X)
>>> agg.transform_many(X)
place revenue
0 2 0.441250
1 3 -1.197680
2 3 -0.693394
3 2 1.449823
4 3 -0.945537
5 2 0.945537
Methods¶
learn_many
Update each transformer.
Parameters
- X (pandas.core.frame.DataFrame)
- y (pandas.core.series.Series) – defaults to
None
learn_one
Update each transformer.
Parameters
- x (dict)
- y – defaults to
None
transform_many
Passes the data through each transformer and packs the results together.
Parameters
- X ('pd.DataFrame')
transform_one
Passes the data through each transformer and packs the results together.
Parameters
- x (dict)