LossyCount¶
Lossy Count with Forgetting factor1.
Keep track of the most frequent item(set)s in a data stream and apply a forgetting factor to discard previous frequent items that do not often appear anymore. This is an approximation algorithm designed to work with a limited amount of memory rather than accounting for every possible solution (thus using an unbounded memory footprint). Any hashable type can be passed as input, hence tuples or frozensets can also be monitored.
Considering a data stream where n elements were observed so far, the Lossy Count algorithm has the following properties:
-
All item(set)s whose true frequency exceeds
support * nare output. There are no false negatives -
No item(set) whose true frequency is less than
(support - epsilon) * nis outputted -
Estimated frequencies are less than the true frequencies by at most
epsilon * n
Parameters¶
-
support (float) – defaults to
0.001The support threshold used to determine if an item is frequent. The value of
supportmust be in \([0, 1]\). Elements whose frequency is higher thansupporttimes the number of observations seen so far are outputted. -
epsilon (float) – defaults to
0.005Error parameter to control the accuracy-memory tradeoff. The value of
epsilonmust be in \((0, 1]\) and typicallyepsilon\(\ll\)support. The smaller theepsilon, the more accurate the estimates will be, but the count sketch will have an increased memory footprint. -
alpha (float) – defaults to
0.999Forgetting factor applied to the frequency estimates to reduce the impact of old items. The value of
alphamust be in \((0, 1]\).
Attributes¶
- name
Examples¶
>>> from river import datasets
>>> from river import stats
>>> dataset = datasets.TREC07()
>>> lc = stats.LossyCount()
>>> for x, _ in dataset.take(10_000):
... lc.update(x["sender"])
>>> counts = lc.get()
>>> for sender in counts:
... print(f"{sender} {lc[sender]:.2f}")
Groupe Desjardins / AccesD <services.de.cartes@scd.desjardins.com> 2494.79
Groupe Desjardins / AccesD <securiteaccesd@desjardins.com> 77.82
"AccuWeather.com Alert" <inbox@messaging.accuweather.com> 67.15
<alert@broadcast.shareholder.com> 56.22
"Bank of America Inc." <Security@bankofamerica.com> 28.37
metze@samba.org 22.95
tridge@samba.org 15.98
Michael Adam <ma@sernet.de> 5.99
abartlet@samba.org 4.00
"Zachary Kline" <Z_kline@hotmail.com> 3.99
Jonathan Worthington <jonathan@jnthn.net> 2.99
charles loboz <charles_loboz@yahoo.com> 2.00
slashdot@slashdot.org 2.00
Methods¶
get
Return the current value of the statistic.
update
Update and return the called instance.
Parameters
- x (Hashable)
References¶
-
Veloso, B., Tabassum, S., Martins, C., Espanha, R., Azevedo, R., & Gama, J. (2020). Interconnect bypass fraud detection: a case study. Annals of Telecommunications, 75(9), 583-596. ↩