StatImputer¶
Replaces missing values with a statistic.
This transformer allows you to replace missing values with the value of a running statistic. During a call to learn_one
, for each feature, a statistic is updated whenever a numeric feature is observed. When transform_one
is called, each feature with a None
value is replaced with the current value of the corresponding statistic.
Parameters¶
-
imputers
A list of tuples where each tuple has two elements. The first elements is a feature name and the second value is an instance of
stats.base.Univariate
. The second value can also be an arbitrary value, such as -1, in which case the missing values will be replaced with it.
Examples¶
from river import preprocessing
from river import stats
For numeric data, we can use a stats.Mean
()` to replace missing values by the running
average of the previously seen values:
X = [
{'temperature': 1},
{'temperature': 8},
{'temperature': 3},
{'temperature': None},
{'temperature': 4}
]
imp = preprocessing.StatImputer(('temperature', stats.Mean()))
for x in X:
imp = imp.learn_one(x)
print(imp.transform_one(x))
{'temperature': 1}
{'temperature': 8}
{'temperature': 3}
{'temperature': 4.0}
{'temperature': 4}
For discrete/categorical data, a common practice is to stats.Mode
to replace missing
values by the most commonly seen value:
X = [
{'weather': 'sunny'},
{'weather': 'rainy'},
{'weather': 'sunny'},
{'weather': None},
{'weather': 'rainy'},
{'weather': 'rainy'},
{'weather': None}
]
imp = preprocessing.StatImputer(('weather', stats.Mode()))
for x in X:
imp = imp.learn_one(x)
print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'rainy'}
You can also choose to replace missing values with a constant value, as so:
imp = preprocessing.StatImputer(('weather', 'missing'))
for x in X:
imp = imp.learn_one(x)
print(imp.transform_one(x))
{'weather': 'sunny'}
{'weather': 'rainy'}
{'weather': 'sunny'}
{'weather': 'missing'}
{'weather': 'rainy'}
{'weather': 'rainy'}
{'weather': 'missing'}
Multiple imputers can be defined by providing a tuple for each feature which you want to impute:
X = [
{'weather': 'sunny', 'temperature': 8},
{'weather': 'rainy', 'temperature': 3},
{'weather': 'sunny', 'temperature': None},
{'weather': None, 'temperature': 4},
{'weather': 'snowy', 'temperature': -4},
{'weather': 'snowy', 'temperature': -3},
{'weather': 'snowy', 'temperature': -3},
{'weather': None, 'temperature': None}
]
imp = preprocessing.StatImputer(
('temperature', stats.Mean()),
('weather', stats.Mode())
)
for x in X:
imp = imp.learn_one(x)
print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 5.5}
{'weather': 'sunny', 'temperature': 4}
{'weather': 'snowy', 'temperature': -4}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': -3}
{'weather': 'snowy', 'temperature': 0.8333}
A sophisticated way to go about imputation is condition the statistics on a given feature. For instance, we might want to replace a missing temperature with the average temperature of a particular weather condition. As an example, consider the following dataset where the temperature is missing, but not the weather condition:
X = [
{'weather': 'sunny', 'temperature': 8},
{'weather': 'rainy', 'temperature': 3},
{'weather': 'sunny', 'temperature': None},
{'weather': 'rainy', 'temperature': 4},
{'weather': 'sunny', 'temperature': 10},
{'weather': 'sunny', 'temperature': None},
{'weather': 'sunny', 'temperature': 12},
{'weather': 'rainy', 'temperature': None}
]
Each missing temperature can be replaced with the average temperature of the corresponding weather condition as so:
from river import compose
imp = compose.Grouper(
preprocessing.StatImputer(('temperature', stats.Mean())),
by='weather'
)
for x in X:
imp = imp.learn_one(x)
print(imp.transform_one(x))
{'weather': 'sunny', 'temperature': 8}
{'weather': 'rainy', 'temperature': 3}
{'weather': 'sunny', 'temperature': 8.0}
{'weather': 'rainy', 'temperature': 4}
{'weather': 'sunny', 'temperature': 10}
{'weather': 'sunny', 'temperature': 9.0}
{'weather': 'sunny', 'temperature': 12}
{'weather': 'rainy', 'temperature': 3.5}
Note that you can also create a Grouper
with the *
operator:
imp = preprocessing.StatImputer(('temperature', stats.Mean())) * 'weather'
Methods¶
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x — 'dict'
Returns
Transformer: self
transform_one
Transform a set of features x
.
Parameters
- x — 'dict'
Returns
dict: The transformed values.