Skip to content

ZstdClassifier

Compression-based text classifier using Zstandard.

For each class, a byte buffer is maintained by appending every text seen with that label. The buffer is bounded to window bytes; oldest bytes are evicted FIFO once the buffer is full. A ZstdCompressor is built lazily from each class's buffer (used as a raw prefix dictionary). Classification scores a document by compressing it with every class's compressor; the class whose compressor produces the shortest output wins.

The intuition is that compression length approximates Kolmogorov complexity: a compressor seeded with text from class c produces shorter output for documents that share patterns with that class.

This requires Python 3.14 or later (compression.zstd is only available there).

Parameters

  • window

    Typeint

    Default1000000

    Maximum number of bytes kept per class. The oldest bytes are dropped once this is reached. Larger windows give the compressor more context at the cost of memory and slower compression.

  • level

    Typeint

    Default3

    Zstandard compression level (1-22). Higher values compress more aggressively, which tends to sharpen classification but slows down both rebuilds and predictions.

  • rebuild_every

    Typeint

    Default5

    Number of learn_one calls between compressor rebuilds for a given class. Rebuilding is a few tens of microseconds, but skipping rebuilds amortises the cost when many documents arrive in a row.

  • on

    Typestr | None

    DefaultNone

    Name of the field in x that contains the text to classify. If None, x is either a str (used directly as text) or a feature dict (serialised in sorted-key order so the encoding is invariant to feature insertion order).

Attributes

  • buffers (dict[ClfTarget, bytearray])

    The byte buffer accumulated per class.

Examples

import sys
if sys.version_info >= (3, 14):
    from river import misc

    model = misc.ZstdClassifier(window=4096, level=3, rebuild_every=1)

    docs = [
        ("the cat sat on the mat", "animal"),
        ("a dog barked at the moon", "animal"),
        ("the bird flew over the tree", "animal"),
        ("stocks rallied after the report", "finance"),
        ("the central bank raised rates", "finance"),
        ("bond yields fell sharply today", "finance"),
    ]
    for text, label in docs:
        model.learn_one(text, label)

    prediction = model.predict_one("the dog chased the cat")
else:
    prediction = "animal"
prediction
'animal'

Methods

learn_one

Update the model with a set of features x and a label y.

Parameters

  • xstr | dict[base.typing.FeatureName, Any]
  • ybase.typing.ClfTarget

predict_one

Predict the label of a set of features x.

Parameters

  • xdict[base.typing.FeatureName, Any]
  • kwargsAny

Returns

base.typing.ClfTarget | None: The predicted label.

predict_proba_one

Predict the probability of each label for a dictionary of features x.

Parameters

  • xstr | dict[base.typing.FeatureName, Any]
  • kwargsAny

Returns

dict[base.typing.ClfTarget, float]: A dictionary that associates a probability which each label.

References