StatImputer¶
Replaces missing values with a statistic.
This transformer allows you to replace missing values with the value of a running statistic. During a call to learn_one
, for each feature, a statistic is updated whenever a numeric feature is observed. When transform_one
is called, each feature with a None
value is replaced with the current value of the corresponding statistic.
Parameters¶
-
imputers
A list of tuples where each tuple has two elements. The first elements is a feature name and the second value is an instance of
stats.base.Univariate
. The second value can also be an arbitrary value, such as -1, in which case the missing values will be replaced with it.
Examples¶
>>> from river import preprocessing
>>> from river import stats
For numeric data, we can use a stats.Mean()
to replace missing values by the running
average of the previously seen values:
>>> X = [
... {'temperature': 1},
... {'temperature': 8},
... {'temperature': 3},
... {'temperature': None},
... {'temperature': 4}
... ]
>>> imp = preprocessing.StatImputer(('temperature', stats.Mean()))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'temperature': 1}
{'temperature': 8}
{'temperature': 3}
{'temperature': 4.0}
{'temperature': 4}
For discrete/categorical data, a common practice is to stats.Mode
to replace missing
values by the most commonly seen value:
>>> X = [
... {'weather': 'sunny'},
... {'weather': 'rainy'},
... {'weather': 'sunny'},
... {'weather': None},
... {'weather': 'rainy'},
... {'weather': 'rainy'},
... {'weather': None}
... ]
>>> imp = preprocessing.StatImputer(('weather', stats.Mode()))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'rainy'}
You can also choose to replace missing values with a constant value, as so:
>>> imp = preprocessing.StatImputer(('weather', 'missing'))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'missing'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'missing'}
Multiple imputers can be defined by providing a tuple for each feature which you want to impute:
>>> X = [
... {'weather': 'sunny', 'temperature': 8},
... {'weather': 'rainy', 'temperature': 3},
... {'weather': 'sunny', 'temperature': None},
... {'weather': None, 'temperature': 4},
... {'weather': 'snowy', 'temperature': -4},
... {'weather': 'snowy', 'temperature': -3},
... {'weather': 'snowy', 'temperature': -3},
... {'weather': None, 'temperature': None}
... ]
>>> imp = preprocessing.StatImputer(
... ('temperature', stats.Mean()),
... ('weather', stats.Mode())
... )
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 5.5}
{'weather': 'sunny', 'temperature': 4}
{'weather': 'snowy', 'temperature': -4}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': 0.8333}
A sophisticated way to go about imputation is condition the statistics on a given feature. For instance, we might want to replace a missing temperature with the average temperature of a particular weather condition. As an example, consider the following dataset where the temperature is missing, but not the weather condition:
>>> X = [
... {'weather': 'sunny', 'temperature': 8},
... {'weather': 'rainy', 'temperature': 3},
... {'weather': 'sunny', 'temperature': None},
... {'weather': 'rainy', 'temperature': 4},
... {'weather': 'sunny', 'temperature': 10},
... {'weather': 'sunny', 'temperature': None},
... {'weather': 'sunny', 'temperature': 12},
... {'weather': 'rainy', 'temperature': None}
... ]
Each missing temperature can be replaced with the average temperature of the corresponding weather condition as so:
>>> from river import compose
>>> imp = compose.Grouper(
... preprocessing.StatImputer(('temperature', stats.Mean())),
... by='weather'
... )
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 8.0}
{'weather': 'rainy', 'temperature': 4}
{'weather': 'sunny', 'temperature': 10}
{'weather': 'sunny', 'temperature': 9.0}
{'weather': 'sunny', 'temperature': 12}
{'weather': 'rainy', 'temperature': 3.5}
Note that you can also create a Grouper
with the *
operator:
>>> imp = preprocessing.StatImputer(('temperature', stats.Mean())) * 'weather'
Methods¶
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x (dict)
Returns
Transformer: self
transform_one
Transform a set of features x
.
Parameters
- x (dict)
Returns
dict: The transformed values.