Skip to content

OneHotEncoder

One-hot encoding.

This transformer will encode every feature it is provided with. If a list or set is provided, this transformer will encode every entry in the list/set. You can apply it to a subset of features by composing it with compose.Select or compose.SelectType.

Parameters

  • sparse – defaults to False

    Whether or not 0s should be made explicit or not.

Examples

Let us first create an example dataset.

>>> from pprint import pprint
>>> import random
>>> import string

>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = [
...     {
...         'c1': random.choice(alphabet),
...         'c2': random.choice(alphabet),
...     }
...     for _ in range(4)
... ]
>>> pprint(X)
[{'c1': 'u', 'c2': 'd'},
    {'c1': 'a', 'c2': 'x'},
    {'c1': 'i', 'c2': 'h'},
    {'c1': 'h', 'c2': 'e'}]

We can now apply one-hot encoding. All the provided are one-hot encoded, there is therefore no need to specify which features to encode.

>>> from river import preprocessing

>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> for x in X:
...     oh = oh.learn_one(x)
...     pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1}
{'c1_h': 1, 'c2_e': 1}

The sparse parameter can be set to False in order to include the values that are not present in the output.

>>> oh = preprocessing.OneHotEncoder(sparse=False)
>>> for x in X[:2]:
...     oh = oh.learn_one(x)
...     pprint(oh.transform_one(x))
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_u': 0, 'c2_d': 0, 'c2_x': 1}

A subset of the features can be one-hot encoded by using an instance of compose.Select.

>>> from river import compose

>>> pp = compose.Select('c1') | preprocessing.OneHotEncoder()

>>> for x in X:
...     pp = pp.learn_one(x)
...     pprint(pp.transform_one(x))
{'c1_u': 1}
{'c1_a': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0}

You can preserve the c2 feature by using a union:

>>> pp = compose.Select('c1') | preprocessing.OneHotEncoder()
>>> pp += compose.Select('c2')

>>> for x in X:
...     pp = pp.learn_one(x)
...     pprint(pp.transform_one(x))
{'c1_u': 1, 'c2': 'd'}
{'c1_a': 1, 'c1_u': 0, 'c2': 'x'}
{'c1_a': 0, 'c1_i': 1, 'c1_u': 0, 'c2': 'h'}
{'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0, 'c2': 'e'}

Similar to the above examples, we can also pass values as a list. This will one-hot encode all of the entries individually.

>>> X = [{'c1': ['u', 'a'], 'c2': ['d']},
...     {'c1': ['a', 'b'], 'c2': ['x']},
...     {'c1': ['i'], 'c2': ['h', 'z']},
...     {'c1': ['h', 'b'], 'c2': ['e']}]

>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> for x in X:
...     oh = oh.learn_one(x)
...     pprint(oh.transform_one(x))
{'c1_a': 1, 'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c1_b': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1, 'c2_z': 1}
{'c1_b': 1, 'c1_h': 1, 'c2_e': 1}

Processing mini-batches is also possible.

>>> from pprint import pprint
>>> import random
>>> import string

>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = pd.DataFrame(
...     {
...         'c1': random.choice(alphabet),
...         'c2': random.choice(alphabet),
...     }
...     for _ in range(4)
... )
>>> X
  c1 c2
0  u  d
1  a  x
2  i  h
3  h  e

>>> oh = preprocessing.OneHotEncoder(sparse=True)
>>> oh = oh.learn_many(X)

>>> df = oh.transform_many(X)
>>> df.loc[:, sorted(df.columns)]
    c1_a  c1_h  c1_i  c1_u  c2_d  c2_e  c2_h  c2_x
0     0     0     0     1     1     0     0     0
1     1     0     0     0     0     0     0     1
2     0     0     1     0     0     0     1     0
3     0     1     0     0     0     1     0     0

Keep in mind that ability for sparse transformations is limited in mini-batch case, which might affect speed/memory footprint of your training loop.

Here's a non-sparse example:

>>> oh = preprocessing.OneHotEncoder(sparse=False)
>>> X_init = pd.DataFrame([{'c1': "Oranges", 'c2': "Apples"}])
>>> oh = oh.learn_many(X_init)
>>> oh = oh.learn_many(X)

>>> df = oh.transform_many(X)
>>> df.loc[:, sorted(df.columns)]
    c1_Oranges  c1_a  c1_h  c1_i  c1_u  c2_Apples  c2_d  c2_e  c2_h  c2_x
0           0     0     0     0     1          0     1     0     0     0
1           0     1     0     0     0          0     0     0     0     1
2           0     0     0     1     0          0     0     0     1     0
3           0     0     1     0     0          0     0     1     0     0

Methods

learn_many

Update with a mini-batch of features.

A lot of transformers don't actually have to do anything during the learn_many step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_many can override this method.

Parameters

  • X ('pd.DataFrame')

Returns

Transformer: self

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

  • x (dict)

Returns

Transformer: self

transform_many

Transform a mini-batch of features.

Parameters

  • X ('pd.DataFrame')

Returns

pd.DataFrame: A new DataFrame.

transform_one

Transform a set of features x.

Parameters

  • x (dict)
  • y – defaults to None

Returns

dict: The transformed values.