Set¶

Approximate tracking of observed items using Bloom filters.

Bloom filters enable using a limited amount of memory to check whether a given item was already observed in a stream. They can be used similarly to Python's built-in sets with the difference that items are not explicitly stored. For that reason, element removal and set difference are not currently supported.

Bloom filters store a bit array and map incoming items to k index positions in the such array. The selected positions are set to True. Therefore, a binary code representation is created for each item. Membership works by projecting the query item and checking if every position of its binary code is True. If that is not the case, the item was not observed yet. A nice property of Bloom filters is that they do not yield false negatives: unobserved items might be signalized as observed, but observed items are never signalized as unobserved.

If more than one item has the same binary code, i.e., hash collisions happen, the accuracy of the Bloom filter decreases, and false positives are produced. For instance, a previously unobserved item is signalized as observed. Increasing the size of the binary array and the value of k increase the filter's accuracy as hash collisions are avoided. Nonetheless, even using an increased number of hash functions, hash collisions will frequently happen if the array capacity is too small. The length of the bit array and the number of hash functions are inferred automatically from the supplied capacity and fp_rate.

Parameters¶

capacity

Type → int

Default → 2048

The maximum capacity of the Bloom filter, i.e., the maximum number of distinct items to store given the selected fp_rate.
fp_rate

Type → float

Default → 0.01

The allowed rate of false positives. The probability of obtaining a true positive is 1 - fp_rate.
seed

Type → int | None

Default → None

Random seed for reproducibility.

Attributes¶

n_bits

Return the size of the binary array used by the Bloom filter.
n_hash

Return the number of used hash functions.

Examples¶

import random
from river import sketch

rng = random.Random(42)
s_set = sketch.Set(capacity=100, seed=0)

We can retrieve the number of selected hash functions:

s_set.n_hash

And the size of the binary array used by the Bloom filter:

s_set.n_bits

We can add new items and check for membership using the same calls used by Python's standard sets:

for _ in range(1000):
    s_set.add(rng.randint(0, 200))

1 in s_set

True

False positives might happen if the capacity is not large enough:

-10 in s_set

True

Iterables can also be supplied to perform multiple updates with a single call to update:

s_set = s_set.update([1, 2, 3, 4, 5, 6, 7])

We can also combine instances of sketch.Set using the intersection and union operations, as long as they share the same hash functions and capability. In other words, all they hyperparameters match. Let's create two instances that will monitor different portions of a stream of random numbers:

s1 = sketch.Set(seed=8)
s2 = sketch.Set(seed=8)

for _ in range(1000):
    s1.add(rng.randint(0, 5000))

for _ in range(1000):
    s2.add(rng.randint(0, 5000))

43 in s1

True

43 in s2

False

We can get the intersection between the two instances by using:

s_intersection = s1 & s2
43 in s_intersection

False

We can also obtain the set union:

s_union = s1 | s2

43 in s_union

True

The same effect of the non-inplace dunder methods can be achieved via explicit method calls:

43 in s1.intersection(s2)

False

43 in s1.union(s2)

True

Methods¶

add

intersection

Set intersection.

Return a new instance that results from the set intersection between the current Set object and other. Dunder operators can be used to replace the method call, i.e., a &= b and a & b for inplace and non-inplace intersections, respectively.

Parameters

other — 'Set'

union

Set union.

Return a new instance that results from the set union between the current Set object and other. Dunder operators can be used to replace the method call, i.e., a |= b and a | b for inplace and non-inplace unions, respectively.

Parameters

other — 'Set'

update

Notes¶

This implementation uses an integer to represent the binary array. Bitwise operations are performed in the integer to reflect the Bloom filter updates.