Counter¶
Counting using the Count-Min Sketch (CMS) algorithm.
Contrary to an exhaustive approach, e.g., using a collections.Counter
, CMS uses a limited and fixed amount of memory. The CMS algorithm uses a sketch structure consisting of a matrix \(w \times d\).
These dimensions are obtained via:
-
\(w = \lceil \frac{e}{\epsilon} \rceil\), where \(e\) is the Euler number.
-
\(d = \lceil \ln\left(\frac{1}{\delta} \right) \rceil\).
Decreasing the values of \(\epsilon\) (epsilon
) and \(\delta\) (delta
) increase the accuracy of the algorithm, at the cost of increased memory usage. The values of w
and d
control the hash tables' capability and the amount of hash collisions, respectively.
CMS works by keeping d
hash tables with w
slots each. Elements are mapped to a slot in each hash table. These tables store the counting estimates. This implementation assumes the turnstile case described in the paper, i.e., count values and updates can be negative.
The count values obtained by CMS are always overestimates. Suppose \(c_i\) and \(\hat{c}_i\) are the ground truth and estimated count values, respectively, for a given element \(i\). CMS guarantees that \(c_i \le \hat{c}_i\) and, with probability \(1 - \delta\), \(\hat{c}_i \le c_i + \epsilon||\mathbf{c}||_1\). In the expression, \(||\mathbf{c}||_1 = \sum_i |c_i|\).
Parameters¶
-
epsilon (float) – defaults to
0.1
The approximation error parameter. The error in answering a query is within a factor of
epsilon
with probabilitydelta
. -
delta (float) – defaults to
0.05
A query estimates have a probability of
1 - delta
of having errors which are a factor ofepsilon
. See the CMS description above for more details. -
seed (int) – defaults to
None
Random seed for reproducibility.
Attributes¶
-
n_slots
The number of slots in each hash table.
-
n_tables
The number of stored hash tables.
Examples¶
>>> import collections
>>> from river import sketch
>>> cms = sketch.Counter(epsilon=0.005, seed=0)
>>> # To generate random numbers
>>> rng = random.Random(7)
>>> counter = collections.Counter()
We can check the number of slots per hash table:
>>> cms.n_slots
544
And the number of hash tables:
>>> cms.n_tables
3
Let's compare the sketch against a brute force approach:
>>> vals = []
>>> for _ in range(10000):
... v = rng.randint(-1000, 1000)
... cms = cms.update(v)
... counter[v] += 1
... vals.append(v)
Now, we can compare the estimates of CMS against the exhaustive counting strategy:
>>> counter[7]
5
>>> cms[7]
12
>>> counter[532]
4
>>> cms[532]
15
Keep in mind that CMS is an approximate sketch algorithm. Couting estimates for unseen values might not be always reliable:
>>> cms[1001]
9
We can check the number of elements stored by each approach:
>>> len(counter), len(cms)
(1982, 1632)
And also retrieve the total sum of counts:
>>> cms.total()
10000
We can decrease the error by allocating more memory in the CMS:
>>> cms_a = sketch.Counter(epsilon=0.001, delta=0.01, seed=0)
>>> for v in vals:
... cms_a = cms_a.update(v)
>>> cms_a[7]
5
>>> cms_a[532]
4
We can also obtain estimates of the dot product between two instances of river.collections.Counter
. This could be useful,
for instance, to estimate the cosine distance between the data monitored in two different counter sketch instances. Suppose we
create another CMS instance (the number of slots and hash tables must match) that monitors another sample of the same data
generating process:
>>> cms_b = sketch.Counter(epsilon=0.001, delta=0.01, seed=7)
>>> for _ in range(10000):
... v = rng.randint(-1000, 1000)
... cms_b = cms_b.update(v)
Now, we can define a cosine distance function:
>>> def cosine_dist(cms_a, cms_b):
... num = cms_a @ cms_b
... den = math.sqrt(cms_a @ cms_a) * math.sqrt(cms_b @ cms_b)
... return num / den
And use it to calculate the cosine distance between the elements monitored in cms_a
and cms_b
:
>>> cosine_dist(cms_a, cms_b)
0.175363...
Methods¶
total
Return the total count.