STREAMKMeans¶
STREAMKMeans
STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1, by replacing the k-medians using LSEARCH
by the k-means algorithm.
However, instead of using the traditional k-means, which requires a total reclustering each time the temporary chunk of data points is full, the implementation of this algorithm uses an increamental k-means.
At first, the cluster centers are initialized with a KMeans
instance. For a new point p
:
-
If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.
-
When the size of chunk reaches the maximum value size allowed
-
A new incremental
KMeans
instance is created. The latter will process all points in the temporary chunk. The centers of this new instance then become the new centers. -
All points are deleted from the temporary chunk so that new points can be added.
-
-
When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original
KMeans
at the time of retrieval.
Parameters¶
-
chunk_size – defaults to
10
Maximum size allowed for the temporary data chunk.
-
n_clusters – defaults to
2
Number of clusters generated by the algorithm.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶
-
centers
Cluster centers generated from running the incremental
KMeans
algorithm through centers of each chunk.
Examples¶
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
... [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]
>>> streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)
>>> for x, _ in stream.iter_array(X):
... streamkmeans = streamkmeans.learn_one(x)
>>> streamkmeans.predict_one({0: 1, 1: 0})
0
>>> streamkmeans.predict_one({0: 5, 1: 2})
1
Methods¶
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
int: A cluster number.
References¶
-
O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785. ↩