OneHotEncoder¶
One-hot encoding.
This transformer will encode every feature it is provided with. If a list or set is provided, this transformer will encode every entry in the list/set. You can apply it to a subset of features by composing it with compose.Select
or compose.SelectType
.
Parameters¶
-
sparse
Default →
False
Whether or not 0s should be made explicit or not.
Examples¶
Let us first create an example dataset.
from pprint import pprint
import random
import string
random.seed(42)
alphabet = list(string.ascii_lowercase)
X = [
{
'c1': random.choice(alphabet),
'c2': random.choice(alphabet),
}
for _ in range(4)
]
pprint(X)
[{'c1': 'u', 'c2': 'd'},
{'c1': 'a', 'c2': 'x'},
{'c1': 'i', 'c2': 'h'},
{'c1': 'h', 'c2': 'e'}]
We can now apply one-hot encoding. All the provided are one-hot encoded, there is therefore no need to specify which features to encode.
from river import preprocessing
oh = preprocessing.OneHotEncoder(sparse=True)
for x in X:
oh = oh.learn_one(x)
pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1}
{'c1_h': 1, 'c2_e': 1}
The sparse
parameter can be set to False
in order to include the values that are not
present in the output.
oh = preprocessing.OneHotEncoder(sparse=False)
for x in X[:2]:
oh = oh.learn_one(x)
pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_u': 0, 'c2_d': 0, 'c2_x': 1}
A subset of the features can be one-hot encoded by using an instance of compose.Select
.
from river import compose
pp = compose.Select('c1') | preprocessing.OneHotEncoder()
for x in X:
pp = pp.learn_one(x)
pprint(pp.transform_one(x))
{'c1_u': 1}
{'c1_a': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0}
You can preserve the c2
feature by using a union:
pp = compose.Select('c1') | preprocessing.OneHotEncoder()
pp += compose.Select('c2')
for x in X:
pp = pp.learn_one(x)
pprint(pp.transform_one(x))
{'c1_u': 1, 'c2': 'd'}
{'c1_a': 1, 'c1_u': 0, 'c2': 'x'}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0, 'c2': 'h'}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0, 'c2': 'e'}
Similar to the above examples, we can also pass values as a list. This will one-hot encode all of the entries individually.
X = [{'c1': ['u', 'a'], 'c2': ['d']},
{'c1': ['a', 'b'], 'c2': ['x']},
{'c1': ['i'], 'c2': ['h', 'z']},
{'c1': ['h', 'b'], 'c2': ['e']}]
oh = preprocessing.OneHotEncoder(sparse=True)
for x in X:
oh = oh.learn_one(x)
pprint(oh.transform_one(x))
{'c1_a': 1, 'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_b': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1, 'c2_z': 1}
{'c1_b': 1, 'c1_h': 1, 'c2_e': 1}
Processing mini-batches is also possible.
from pprint import pprint
import random
import string
random.seed(42)
alphabet = list(string.ascii_lowercase)
X = pd.DataFrame(
{
'c1': random.choice(alphabet),
'c2': random.choice(alphabet),
}
for _ in range(4)
)
X
c1 c2
0 u d
1 a x
2 i h
3 h e
oh = preprocessing.OneHotEncoder(sparse=True)
oh = oh.learn_many(X)
df = oh.transform_many(X)
df.loc[:, sorted(df.columns)]
c1_a c1_h c1_i c1_u c2_d c2_e c2_h c2_x
0 0 0 0 1 1 0 0 0
1 1 0 0 0 0 0 0 1
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 1 0 0
Keep in mind that ability for sparse transformations is limited in mini-batch case, which might affect speed/memory footprint of your training loop.
Here's a non-sparse example:
oh = preprocessing.OneHotEncoder(sparse=False)
X_init = pd.DataFrame([{'c1': "Oranges", 'c2': "Apples"}])
oh = oh.learn_many(X_init)
oh = oh.learn_many(X)
df = oh.transform_many(X)
df.loc[:, sorted(df.columns)]
c1_Oranges c1_a c1_h c1_i c1_u c2_Apples c2_d c2_e c2_h c2_x
0 0 0 0 0 1 0 1 0 0 0
1 0 1 0 0 0 0 0 0 0 1
2 0 0 0 1 0 0 0 0 1 0
3 0 0 1 0 0 0 0 1 0 0
Methods¶
learn_many
Update with a mini-batch of features.
A lot of transformers don't actually have to do anything during the learn_many
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_many
can override this method.
Parameters
- X — 'pd.DataFrame'
Returns
Transformer: self
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x — 'dict'
Returns
Transformer: self
transform_many
Transform a mini-batch of features.
Parameters
- X — 'pd.DataFrame'
Returns
pd.DataFrame: A new DataFrame.
transform_one
Transform a set of features x
.
Parameters
- x — 'dict'
- y — defaults to
None
Returns
dict: The transformed values.