CluStream¶
CluStream
The CluStream algorithm 1 maintains statistical information about the data using micro-clusters. These micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.
Training with a new point p
is performed in two main tasks:
-
Determinate the closest micro-cluster to
p
. -
Check whether
p
fits (memory) into the closest micro-cluster:-
if
p
fits, put into micro-cluster -
if
p
does not fit, free some space to insert a new micro-cluster. This is done in two ways, delete an old micro-cluster or merge the two micro-clusters closest to each other.
-
This implementation is an improved version from the original algorithm. Instead of calculating the traditional cluster feature vector of the number of observations, linear sum and sum of squares of data points and time stamps, this implementation adopts the use of Welford's algorithm 2 to calculate the incremental variance, facilitated through stats.Var()
available within River
.
Since River does not support an actual "off-line" phase of the clustering algorithm (as data points are assumed to arrive continuously, one at a time), a time_gap
parameter is introduced. After each time_gap
, an incremental K-Means clustering algorithm will be initialized and applied on currently available micro-clusters to form the final solution, i.e. macro-clusters.
Parameters¶
-
n_macro_clusters (int) – defaults to
5
The number of clusters (k) for the k-means algorithm.
-
max_micro_clusters (int) – defaults to
100
The maximum number of micro-clusters to use.
-
micro_cluster_r_factor (int) – defaults to
2
Multiplier for the micro-cluster radius. When deciding to add a new data point to a micro-cluster, the maximum boundary is defined as a factor of the
micro_cluster_r_factor
of the RMS deviation of the data points in the micro-cluster from the centroid. -
time_window (int) – defaults to
1000
If the current time is
T
and the time window ish
, we only consider the data that arrived within the period(T-h,T)
. -
time_gap (int) – defaults to
100
An incremental k-means is applied on the current set of micro-clusters after each
time_gap
to form the final macro-cluster solution. -
seed (int) – defaults to
None
Random seed used for generating initial centroid positions.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶
-
centers (dict)
Central positions of each cluster.
Examples¶
In the following example, max_micro_clusters
is set relatively low due to the
limited number of training points. Moreover, all points are learnt before any predictions are made.
The halflife
is set at 0.4, to show that you can pass cluster.KMeans
parameters via keyword arguments.
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 2],
... [1, 4],
... [1, 0],
... [-4, 2],
... [-4, 4],
... [-4, 0],
... [5, 0],
... [5, 2],
... [5, 4]
... ]
>>> clustream = cluster.CluStream(
... n_macro_clusters=3,
... max_micro_clusters=5,
... time_gap=3,
... seed=0,
... halflife=0.4
... )
>>> for x, _ in stream.iter_array(X):
... clustream = clustream.learn_one(x)
>>> clustream.predict_one({0: 1, 1: 1})
1
>>> clustream.predict_one({0: -4, 1: 3})
2
>>> clustream.predict_one({0: 4, 1: 3.5})
0
Methods¶
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- w – defaults to
1.0
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
Returns
int: A cluster number.
References¶
-
Aggarwal, C.C., Philip, S.Y., Han, J. and Wang, J., 2003, A framework for clustering evolving data streams. In Proceedings 2003 VLDB conference (pp. 81-92). Morgan Kaufmann. ↩
-
Chan, T.F., Golub, G.H. and LeVeque, R.J., 1982. Updating formulae and a pairwise algorithm for computing sample variances. In COMPSTAT 1982 5th Symposium held at Toulouse 1982 (pp. 30-41). Physica, Heidelberg. https://doi.org/10.1007/978-3-642-51461-6_3. ↩