STREAMKMeans¶

STREAMKMeans

STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. ¹, by replacing the k-medians using LSEARCH by the k-means algorithm.

However, instead of using the traditional k-means, which requires a total reclustering each time the temporary chunk of data points is full, the implementation of this algorithm uses an increamental k-means.

At first, the cluster centers are initialized with a KMeans instance. For a new point p:

If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.
When the size of chunk reaches the maximum value size allowed
- A new incremental KMeans instance is created. The latter will process all points in the
temporary chunk. The centers of this new instance then become the new centers.
- All points are deleted from the temporary chunk so that new points can be added.
When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original KMeans at the time of retrieval.

Parameters¶

chunk_size

Default → 10

Maximum size allowed for the temporary data chunk.
n_clusters

Default → 2

Number of clusters generated by the algorithm.
kwargs

Other parameters passed to the incremental kmeans at cluster.KMeans.

Attributes¶

centers

Cluster centers generated from running the incremental KMeans algorithm through centers of each chunk.

Examples¶

from river import cluster
from river import stream

X = [
    [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
    [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
]

streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)

for x, _ in stream.iter_array(X):
    streamkmeans.learn_one(x)

streamkmeans.predict_one({0: 1, 1: 0})

streamkmeans.predict_one({0: 5, 1: 2})

Methods¶

learn_one

Update the model with a set of features x.

Parameters

x — 'dict'
w — defaults to None

predict_one

Predicts the cluster number for a set of features x.

Parameters

x — 'dict'
w — defaults to None

Returns

int: A cluster number.

O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785. ↩