ZstdClassifier¶
Compression-based text classifier using Zstandard.
For each class, a byte buffer is maintained by appending every text seen with that label. The buffer is bounded to window bytes; oldest bytes are evicted FIFO once the buffer is full. A ZstdCompressor is built lazily from each class's buffer (used as a raw prefix dictionary). Classification scores a document by compressing it with every class's compressor; the class whose compressor produces the shortest output wins.
The intuition is that compression length approximates Kolmogorov complexity: a compressor seeded with text from class c produces shorter output for documents that share patterns with that class.
This requires Python 3.14 or later (compression.zstd is only available there).
Parameters¶
-
window
Type →
intDefault →
1000000Maximum number of bytes kept per class. The oldest bytes are dropped once this is reached. Larger windows give the compressor more context at the cost of memory and slower compression.
-
level
Type →
intDefault →
3Zstandard compression level (1-22). Higher values compress more aggressively, which tends to sharpen classification but slows down both rebuilds and predictions.
-
rebuild_every
Type →
intDefault →
5Number of
learn_onecalls between compressor rebuilds for a given class. Rebuilding is a few tens of microseconds, but skipping rebuilds amortises the cost when many documents arrive in a row. -
on
Type →
str | NoneDefault →
NoneName of the field in
xthat contains the text to classify. IfNone,xis either astr(used directly as text) or a featuredict(serialised in sorted-key order so the encoding is invariant to feature insertion order).
Attributes¶
-
buffers (
dict[ClfTarget, bytearray])The byte buffer accumulated per class.
Examples¶
import sys
if sys.version_info >= (3, 14):
from river import misc
model = misc.ZstdClassifier(window=4096, level=3, rebuild_every=1)
docs = [
("the cat sat on the mat", "animal"),
("a dog barked at the moon", "animal"),
("the bird flew over the tree", "animal"),
("stocks rallied after the report", "finance"),
("the central bank raised rates", "finance"),
("bond yields fell sharply today", "finance"),
]
for text, label in docs:
model.learn_one(text, label)
prediction = model.predict_one("the dog chased the cat")
else:
prediction = "animal"
prediction
'animal'
Methods¶
learn_one
Update the model with a set of features x and a label y.
Parameters
- x —
str | dict[base.typing.FeatureName, Any] - y —
base.typing.ClfTarget
predict_one
Predict the label of a set of features x.
Parameters
- x —
dict[base.typing.FeatureName, Any] - kwargs —
Any
Returns
base.typing.ClfTarget | None: The predicted label.
predict_proba_one
Predict the probability of each label for a dictionary of features x.
Parameters
- x —
str | dict[base.typing.FeatureName, Any] - kwargs —
Any
Returns
dict[base.typing.ClfTarget, float]: A dictionary that associates a probability which each label.