OneHotEncoder¶
One-hot encoding.
This transformer will encode every feature it is provided with. If a list or set is provided, this transformer will encode every entry in the list/set. You can apply it to a subset of features by composing it with compose.Select
or compose.SelectType
.
Parameters¶
-
sparse – defaults to
False
Whether or not 0s should be made explicit or not.
Examples¶
Let us first create an example dataset.
>>> from pprint import pprint
>>> import random
>>> import string
>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = [
... {
... 'c1': random.choice(alphabet),
... 'c2': random.choice(alphabet),
... }
... for _ in range(4)
... ]
>>> pprint(X)
[{'c1': 'u', 'c2': 'd'},
{'c1': 'a', 'c2': 'x'},
{'c1': 'i', 'c2': 'h'},
{'c1': 'h', 'c2': 'e'}]
We can now apply one-hot encoding. All the provided are one-hot encoded, there is therefore no need to specify which features to encode.
>>> from river import preprocessing
>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> for x in X:
... oh = oh.learn_one(x)
... pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1}
{'c1_h': 1, 'c2_e': 1}
The sparse
parameter can be set to False
in order to include the values that are not
present in the output.
>>> oh = preprocessing.OneHotEncoder(sparse=False)
>>> for x in X[:2]:
... oh = oh.learn_one(x)
... pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_u': 0, 'c2_d': 0, 'c2_x': 1}
A subset of the features can be one-hot encoded by using an instance of compose.Select
.
>>> from river import compose
>>> pp = compose.Select('c1') | preprocessing.OneHotEncoder()
>>> for x in X:
... pp = pp.learn_one(x)
... pprint(pp.transform_one(x))
{'c1_u': 1}
{'c1_a': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0}
You can preserve the c2
feature by using a union:
>>> pp = compose.Select('c1') | preprocessing.OneHotEncoder()
>>> pp += compose.Select('c2')
>>> for x in X:
... pp = pp.learn_one(x)
... pprint(pp.transform_one(x))
{'c1_u': 1, 'c2': 'd'}
{'c1_a': 1, 'c1_u': 0, 'c2': 'x'}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0, 'c2': 'h'}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0, 'c2': 'e'}
Similar to the above examples, we can also pass values as a list. This will one-hot encode all of the entries individually.
>>> X = [{'c1': ['u', 'a'], 'c2': ['d']},
... {'c1': ['a', 'b'], 'c2': ['x']},
... {'c1': ['i'], 'c2': ['h', 'z']},
... {'c1': ['h', 'b'], 'c2': ['e']}]
>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> for x in X:
... oh = oh.learn_one(x)
... pprint(oh.transform_one(x))
{'c1_a': 1, 'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_b': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1, 'c2_z': 1}
{'c1_b': 1, 'c1_h': 1, 'c2_e': 1}
Processing mini-batches is also possible.
>>> from pprint import pprint
>>> import random
>>> import string
>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = pd.DataFrame(
... {
... 'c1': random.choice(alphabet),
... 'c2': random.choice(alphabet),
... }
... for _ in range(4)
... )
>>> X
c1 c2
0 u d
1 a x
2 i h
3 h e
>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> oh = oh.learn_many(X)
>>> df = oh.transform_many(X)
>>> df.loc[:, sorted(df.columns)]
c1_a c1_h c1_i c1_u c2_d c2_e c2_h c2_x
0 0 0 0 1 1 0 0 0
1 1 0 0 0 0 0 0 1
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 1 0 0
Keep in mind that ability for sparse transformations is limited in mini-batch case, which might affect speed/memory footprint of your training loop.
Here's a non-sparse example:
>>> oh = preprocessing.OneHotEncoder(sparse=False)
>>> X_init = pd.DataFrame([{'c1': "Oranges", 'c2': "Apples"}])
>>> oh = oh.learn_many(X_init)
>>> oh = oh.learn_many(X)
>>> df = oh.transform_many(X)
>>> df.loc[:, sorted(df.columns)]
c1_Oranges c1_a c1_h c1_i c1_u c2_Apples c2_d c2_e c2_h c2_x
0 0 0 0 0 1 0 1 0 0 0
1 0 1 0 0 0 0 0 0 0 1
2 0 0 0 1 0 0 0 0 1 0
3 0 0 1 0 0 0 0 1 0 0
Methods¶
learn_many
Update with a mini-batch of features.
A lot of transformers don't actually have to do anything during the learn_many
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_many
can override this method.
Parameters
- X ('pd.DataFrame')
Returns
Transformer: self
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x (dict)
Returns
Transformer: self
transform_many
Transform a mini-batch of features.
Parameters
- X ('pd.DataFrame')
Returns
pd.DataFrame: A new DataFrame.
transform_one
Transform a set of features x
.
Parameters
- x (dict)
- y – defaults to
None
Returns
dict: The transformed values.