StatImputer¶

Replaces missing values with a statistic.

This transformer allows you to replace missing values with the value of a running statistic. During a call to learn_one, for each feature, a statistic is updated whenever a numeric feature is observed. When transform_one is called, each feature with a None value is replaced with the current value of the corresponding statistic.

Parameters¶

imputers

A list of tuples where each tuple has two elements. The first elements is a feature name and the second value is an instance of stats.base.Univariate. The second value can also be an arbitrary value, such as -1, in which case the missing values will be replaced with it.

Examples¶

>>> from river import preprocessing
>>> from river import stats

For numeric data, we can use a stats.Mean() to replace missing values by the running average of the previously seen values:

>>> X = [
...     {'temperature': 1},
...     {'temperature': 8},
...     {'temperature': 3},
...     {'temperature': None},
...     {'temperature': 4}
... ]

>>> imp = preprocessing.StatImputer(('temperature', stats.Mean()))

>>> for x in X:
...     imp = imp.learn_one(x)
...     print(imp.transform_one(x))
{'temperature': 1}
{'temperature': 8}
{'temperature': 3}
{'temperature': 4.0}
{'temperature': 4}

For discrete/categorical data, a common practice is to stats.Mode to replace missing values by the most commonly seen value:

>>> X = [
...     {'weather': 'sunny'},
...     {'weather': 'rainy'},
...     {'weather': 'sunny'},
...     {'weather': None},
...     {'weather': 'rainy'},
...     {'weather': 'rainy'},
...     {'weather': None}
... ]

>>> imp = preprocessing.StatImputer(('weather', stats.Mode()))

>>> for x in X:
...     imp = imp.learn_one(x)
...     print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'rainy'}

You can also choose to replace missing values with a constant value, as so:

>>> imp = preprocessing.StatImputer(('weather', 'missing'))

>>> for x in X:
...     imp = imp.learn_one(x)
...     print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'missing'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'missing'}

Multiple imputers can be defined by providing a tuple for each feature which you want to impute:

>>> X = [
...     {'weather': 'sunny', 'temperature': 8},
...     {'weather': 'rainy', 'temperature': 3},
...     {'weather': 'sunny', 'temperature': None},
...     {'weather': None, 'temperature': 4},
...     {'weather': 'snowy', 'temperature': -4},
...     {'weather': 'snowy', 'temperature': -3},
...     {'weather': 'snowy', 'temperature': -3},
...     {'weather': None, 'temperature': None}
... ]

>>> imp = preprocessing.StatImputer(
...     ('temperature', stats.Mean()),
...     ('weather', stats.Mode())
... )

>>> for x in X:
...     imp = imp.learn_one(x)
...     print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 5.5}
{'weather': 'sunny', 'temperature': 4}
{'weather': 'snowy', 'temperature': -4}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': 0.8333}

A sophisticated way to go about imputation is condition the statistics on a given feature. For instance, we might want to replace a missing temperature with the average temperature of a particular weather condition. As an example, consider the following dataset where the temperature is missing, but not the weather condition:

>>> X = [
...     {'weather': 'sunny', 'temperature': 8},
...     {'weather': 'rainy', 'temperature': 3},
...     {'weather': 'sunny', 'temperature': None},
...     {'weather': 'rainy', 'temperature': 4},
...     {'weather': 'sunny', 'temperature': 10},
...     {'weather': 'sunny', 'temperature': None},
...     {'weather': 'sunny', 'temperature': 12},
...     {'weather': 'rainy', 'temperature': None}
... ]

Each missing temperature can be replaced with the average temperature of the corresponding weather condition as so:

>>> from river import compose

>>> imp = compose.Grouper(
...     preprocessing.StatImputer(('temperature', stats.Mean())),
...     by='weather'
... )

>>> for x in X:
...     imp = imp.learn_one(x)
...     print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 8.0}
{'weather': 'rainy', 'temperature': 4}
{'weather': 'sunny', 'temperature': 10}
{'weather': 'sunny', 'temperature': 9.0}
{'weather': 'sunny', 'temperature': 12}
{'weather': 'rainy', 'temperature': 3.5}

Note that you can also create a Grouper with the * operator:

>>> imp = preprocessing.StatImputer(('temperature', stats.Mean())) * 'weather'

Methods¶

clone

Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

x (dict)

Returns

Transformer: self

transform_one

Transform a set of features x.

Parameters

x (dict)

Returns

dict: The transformed values.