# STREAMKMeans¶

STREAMKMeans

STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1, by replacing the k-medians using LSEARCH by the k-means algorithm.

However, instead of using the traditional k-means, which requires a total reclustering each time the temporary chunk of data points is full, the implementation of this algorithm uses an increamental k-means.

At first, the cluster centers are initialized with a KMeans instance. For a new point p:

• If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.

• When the size of chunk reaches the maximum value size allowed

• A new incremental KMeans instance is created. The latter will process all points in the temporary chunk. The centers of this new instance then become the new centers.

• All points are deleted from the temporary chunk so that new points can be added.

• When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original KMeans at the time of retrieval.

## Parameters¶

• chunk_size – defaults to 10

Maximum size allowed for the temporary data chunk.

• n_clusters – defaults to 2

Number of clusters generated by the algorithm.

• kwargs

Other parameters passed to the incremental kmeans at cluster.KMeans.

## Attributes¶

• centers

Cluster centers generated from running the incremental KMeans algorithm through centers of each chunk.

## Examples¶

>>> from river import cluster
>>> from river import stream

>>> X = [
...     [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
...     [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]

>>> streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)

>>> for x, _ in stream.iter_array(X):
...     streamkmeans = streamkmeans.learn_one(x)

>>> streamkmeans.predict_one({0: 1, 1: 0})
0

>>> streamkmeans.predict_one({0: 5, 1: 2})
1


## Methods¶

learn_one

Update the model with a set of features x.

Parameters

• x (dict)
• sample_weight – defaults to None

Returns

Clusterer: self

predict_one

Predicts the cluster number for a set of features x.

Parameters

• x (dict)
• sample_weight – defaults to None

Returns

int: A cluster number.

## References¶

1. O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785.