Shuffles a stream of data.
This works by maintaining a buffer of elements. The first
buffer_size elements are stored in memory. Once the buffer is full, a random element inside the buffer is yielded. Every time an element is yielded, the next element in the stream replaces it and the buffer is sampled again. Increasing
buffer_size will improve the quality of the shuffling.
If you really want to stream over your dataset in a "good" random order, the best way is to split your dataset into smaller datasets and loop over them in a round-robin fashion. You may do this by using the
roundrobin recipe from the
The stream to shuffle.
The size of the buffer which contains the elements help in memory. Increasing this will increase randomness but will incur more memory usage.
seed (int) – defaults to
Random seed used for sampling.
>>> from river import stream >>> for i in stream.shuffle(range(15), buffer_size=5, seed=42): ... print(i) 0 5 2 1 8 9 6 4 11 12 10 7 14 13 3