StatImputer¶
Replaces missing values with a statistic.
This transformer allows you to replace missing values with the value of a running statistic. During a call to learn_one
, for each feature, a statistic is updated whenever a numeric feature is observed. When transform_one
is called, each feature with a None
value is replaced with the current value of the corresponding statistic.
Parameters¶
-
imputers
A list of tuples where each tuple has two elements. The first elements is a feature name and the second value is an instance of
stats.Univariate
. The second value can also be an arbitrary value, such as -1, in which case the missing values will be replaced with it.
Examples¶
>>> from river import preprocessing
>>> from river import stats
For numeric data, we can use a stats.Mean()
to replace missing values by the running
average of the previously seen values:
>>> X = [
... {'temperature': 1},
... {'temperature': 8},
... {'temperature': 3},
... {'temperature': None},
... {'temperature': 4}
... ]
>>> imp = preprocessing.StatImputer(('temperature', stats.Mean()))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'temperature': 1}
{'temperature': 8}
{'temperature': 3}
{'temperature': 4.0}
{'temperature': 4}
For discrete/categorical data, a common practice is to stats.Mode
to replace missing
values by the most commonly seen value:
>>> X = [
... {'weather': 'sunny'},
... {'weather': 'rainy'},
... {'weather': 'sunny'},
... {'weather': None},
... {'weather': 'rainy'},
... {'weather': 'rainy'},
... {'weather': None}
... ]
>>> imp = preprocessing.StatImputer(('weather', stats.Mode()))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'rainy'}
You can also choose to replace missing values with a constant value, as so:
>>> imp = preprocessing.StatImputer(('weather', 'missing'))
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'missing'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'missing'}
Multiple imputers can be defined by providing a tuple for each feature which you want to impute:
>>> X = [
... {'weather': 'sunny', 'temperature': 8},
... {'weather': 'rainy', 'temperature': 3},
... {'weather': 'sunny', 'temperature': None},
... {'weather': None, 'temperature': 4},
... {'weather': 'snowy', 'temperature': -4},
... {'weather': 'snowy', 'temperature': -3},
... {'weather': 'snowy', 'temperature': -3},
... {'weather': None, 'temperature': None}
... ]
>>> imp = preprocessing.StatImputer(
... ('temperature', stats.Mean()),
... ('weather', stats.Mode())
... )
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 5.5}
{'weather': 'sunny', 'temperature': 4}
{'weather': 'snowy', 'temperature': -4}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': 0.8333}
A sophisticated way to go about imputation is condition the statistics on a given feature. For instance, we might want to replace a missing temperature with the average temperature of a particular weather condition. As an example, consider the following dataset where the temperature is missing, but not the weather condition:
>>> X = [
... {'weather': 'sunny', 'temperature': 8},
... {'weather': 'rainy', 'temperature': 3},
... {'weather': 'sunny', 'temperature': None},
... {'weather': 'rainy', 'temperature': 4},
... {'weather': 'sunny', 'temperature': 10},
... {'weather': 'sunny', 'temperature': None},
... {'weather': 'sunny', 'temperature': 12},
... {'weather': 'rainy', 'temperature': None}
... ]
Each missing temperature can be replaced with the average temperature of the corresponding weather condition as so:
>>> from river import compose
>>> imp = compose.Grouper(
... preprocessing.StatImputer(('temperature', stats.Mean())),
... by='weather'
... )
>>> for x in X:
... imp = imp.learn_one(x)
... print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 8.0}
{'weather': 'rainy', 'temperature': 4}
{'weather': 'sunny', 'temperature': 10}
{'weather': 'sunny', 'temperature': 9.0}
{'weather': 'sunny', 'temperature': 12}
{'weather': 'rainy', 'temperature': 3.5}
Note that you can also create a Grouper
with the *
operator:
>>> imp = preprocessing.StatImputer(('temperature', stats.Mean())) * 'weather'
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x (dict)
Returns
Transformer: self
transform_one
Transform a set of features x
.
Parameters
- x (dict)
Returns
dict: The transformed values.