STREAMKMeans¶
STREAMKMeans
STREAMKMeans is an alternative version of the original algorithm STREAMLSEARCH proposed by O'Callaghan et al. 1 by replacing the k-Medians
using LSEARCH
by the classical KMeans
algorithm.
However, instead of using the traditional KMeans
that requires a total reclustering after each time the temporary chunk of data points is full, the implementation of this algorithm in River
uses the increamental KMeans
. This allows the algorithm to update KMeans
without the need of re-initialization, saving a substantial amount of computing resources.
The algorithm is constructed as follows. To begin, the algorithm will be initialized with an incremental KMeans
algorithm with the same number of centers as required. For a new point p
:
-
If the size of chunk is less than the maximum size allowed, add the new point to the temporary chunk.
-
When the size of chunk reaches the maximum value size allowed
-
A new incremental
KMeans
algorithm will be initiated. This algorithm will run through all points in the temporary chunk. The centers of this new algorithm will be passed through the originally initializedKMeans
to update the centers of the algorithm -
All points will be deleted from the temporary chunk to continue adding new points later.
-
-
When a prediction request arrives, the centers of the algorithm will be exactly the same as the centers of the original
KMeans
at the time of retrieval.
Parameters¶
-
chunk_size – defaults to
10
Maximum size allowed for the temporary data chunk.
-
n_clusters – defaults to
2
Number of clusters generated by the algorithm.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶
-
centers
Cluster centers generated from running the incremental
KMeans
algorithm through centers of each chunk.
Examples¶
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
... [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]
>>> streamkmeans = cluster.STREAMKMeans(chunk_size=3, n_clusters=2, halflife=0.5, sigma=1.5, seed=0)
>>> for x, _ in stream.iter_array(X):
... streamkmeans = streamkmeans.learn_one(x)
>>> streamkmeans.predict_one({0: 1, 1: 0})
0
>>> streamkmeans.predict_one({0: 5, 1: 2})
1
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight (int) – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
int: A cluster number.
References¶
-
O'Callaghan et al. (2002). Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering, Feb 26 - March 1, San Jose, CA, USA. DOI: 10.1109/ICDE.2002.994785. ↩