Skip to content

shuffle

Shuffles a stream of data.

This works by maintaining a buffer of elements. The first buffer_size elements are stored in memory. Once the buffer is full, a random element inside the buffer is yielded. Every time an element is yielded, the next element in the stream replaces it and the buffer is sampled again. Increasing buffer_size will improve the quality of the shuffling.

If you really want to stream over your dataset in a "good" random order, the best way is to split your dataset into smaller datasets and loop over them in a round-robin fashion. You may do this by using the roundrobin recipe from the itertools module.

Parameters

  • stream (Iterator)

    The stream to shuffle.

  • buffer_size (int)

    The size of the buffer which contains the elements help in memory. Increasing this will increase randomness but will incur more memory usage.

  • seed (int) – defaults to None

    Random seed used for sampling.

Examples

>>> from river import stream

>>> for i in stream.shuffle(range(15), buffer_size=5, seed=42):
...     print(i)
0
5
2
1
8
9
6
4
11
12
10
7
14
13
3

References