Skip to content

STREAMKMeans

STREAMKMeans

STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1 by replacing the k-Medians using LSEARCH by the classical KMeans algorithm.

However, instead of using the traditional KMeans that requires a total reclustering after each time the temporary chunk of data points is full, the implementation of this algorithm in River uses the increamental KMeans. This allows the algorithm to update KMeans without the need of re-initialization, saving a substantial amount of computing resources.

The algorithm is constructed as follows. To begin, the algorithm will be initialized with an incremental KMeans algorithm with the same number of centers as required. For a new point p:

  • If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.

  • When the size of chunk reaches the maximum value size allowed

    • A new incremental KMeans algorithm will be initiated. This algorithm will run through all points in the temporary chunk. The centers of this new algorithm will be passed through the originally initialized KMeans to update the centers of the algorithm

    • All points will be deleted from the temporary chunk to continue adding new points later.

  • When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original KMeans at the time of retrieval.

Parameters

  • chunk_size – defaults to 10

    Maximum size allowed for the temporary data chunk.

  • n_clusters – defaults to 2

    Number of clusters generated by the algorithm.

  • kwargs

    Other parameters passed to the incremental kmeans at cluster.KMeans.

Attributes

  • centers

    Cluster centers generated from running the incremental KMeans algorithm through centers of each chunk.

Examples

>>> from river import cluster
>>> from river import stream

>>> X = [
...     [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
...     [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]

>>> streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)

>>> for x, _ in stream.iter_array(X):
...     streamkmeans = streamkmeans.learn_one(x)

>>> streamkmeans.predict_one({0: 1, 1: 0})
0

>>> streamkmeans.predict_one({0: 5, 1: 2})
1

Methods

clone

Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.

learn_one

Update the model with a set of features x.

Parameters

  • x (dict)
  • sample_weight (int) – defaults to None

Returns

Clusterer: self

predict_one

Predicts the cluster number for a set of features x.

Parameters

  • x (dict)
  • sample_weight – defaults to None

Returns

int: A cluster number.

References


  1. O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785.