Skip to content

STREAMKMeans

STREAMKMeans

STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1, by replacing the k-medians using LSEARCH by the k-means algorithm.

However, instead of using the traditional k-means, which requires a total reclustering each time the temporary chunk of data points is full, the implementation of this algorithm uses an increamental k-means.

At first, the cluster centers are initialized with a KMeans instance. For a new point p:

  • If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.

  • When the size of chunk reaches the maximum value size allowed

    • A new incremental KMeans instance is created. The latter will process all points in the temporary chunk. The centers of this new instance then become the new centers.

    • All points are deleted from the temporary chunk so that new points can be added.

  • When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original KMeans at the time of retrieval.

Parameters

  • chunk_size – defaults to 10

    Maximum size allowed for the temporary data chunk.

  • n_clusters – defaults to 2

    Number of clusters generated by the algorithm.

  • kwargs

    Other parameters passed to the incremental kmeans at cluster.KMeans.

Attributes

  • centers

    Cluster centers generated from running the incremental KMeans algorithm through centers of each chunk.

Examples

>>> from river import cluster
>>> from river import stream

>>> X = [
...     [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
...     [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]

>>> streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)

>>> for x, _ in stream.iter_array(X):
...     streamkmeans = streamkmeans.learn_one(x)

>>> streamkmeans.predict_one({0: 1, 1: 0})
0

>>> streamkmeans.predict_one({0: 5, 1: 2})
1

Methods

learn_one

Update the model with a set of features x.

Parameters

  • x (dict)
  • sample_weight – defaults to None

Returns

Clusterer: self

predict_one

Predicts the cluster number for a set of features x.

Parameters

  • x (dict)
  • sample_weight – defaults to None

Returns

int: A cluster number.

References


  1. O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785.