STREAMKMeans¶
STREAMKMeans
STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1, by replacing the k-medians using LSEARCH
by the k-means algorithm.
However, instead of using the traditional k-means, which requires a total reclustering each time the temporary chunk of data points is full, the implementation of this algorithm uses an increamental k-means.
At first, the cluster centers are initialized with a KMeans
instance. For a new point p
:
-
If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.
-
When the size of chunk reaches the maximum value size allowed
- A new incremental
KMeans
instance is created. The latter will process all points in the
temporary chunk. The centers of this new instance then become the new centers.
- All points are deleted from the temporary chunk so that new points can be added.
- A new incremental
-
When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original
KMeans
at the time of retrieval.
Parameters¶
-
chunk_size
Default →
10
Maximum size allowed for the temporary data chunk.
-
n_clusters
Default →
2
Number of clusters generated by the algorithm.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶
-
centers
Cluster centers generated from running the incremental
KMeans
algorithm through centers of each chunk.
Examples¶
from river import cluster
from river import stream
X = [
[1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
[4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
]
streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)
for x, _ in stream.iter_array(X):
streamkmeans.learn_one(x)
streamkmeans.predict_one({0: 1, 1: 0})
0
streamkmeans.predict_one({0: 5, 1: 2})
1
Methods¶
learn_one
Update the model with a set of features x
.
Parameters
- x — 'dict'
- w — defaults to
None
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x — 'dict'
- w — defaults to
None
Returns
int: A cluster number.
-
O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785. ↩