CluStream¶
CluStream
The CluStream algorithm ^{1} maintains statistical information about the data using microclusters. These microclusters are temporal extensions of cluster feature vectors. The microclusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.
Training with a new point p
is performed in two main tasks:

Determinate the closest microcluster to
p
. 
Check whether
p
fits (memory) into the closest microcluster:
if
p
fits, put into microcluster 
if
p
does not fit, free some space to insert a new microcluster.
This is done in two ways, delete an old microcluster or merge the two microclusters closest to each other.

This implementation is an improved version from the original algorithm. Instead of calculating the traditional cluster feature vector of the number of observations, linear sum and sum of squares of data points and time stamps, this implementation adopts the use of Welford's algorithm ^{2} to calculate the incremental variance, facilitated through stats.Var
available within River.
Since River does not support an actual "offline" phase of the clustering algorithm (as data points are assumed to arrive continuously, one at a time), a time_gap
parameter is introduced. After each time_gap
, an incremental KMeans clustering algorithm will be initialized and applied on currently available microclusters to form the final solution, i.e. macroclusters.
Parameters¶

n_macro_clusters
Type → int
Default →
5
The number of clusters (k) for the kmeans algorithm.

max_micro_clusters
Type → int
Default →
100
The maximum number of microclusters to use.

micro_cluster_r_factor
Type → int
Default →
2
Multiplier for the microcluster radius. When deciding to add a new data point to a microcluster, the maximum boundary is defined as a factor of the
micro_cluster_r_factor
of the RMS deviation of the data points in the microcluster from the centroid. 
time_window
Type → int
Default →
1000
If the current time is
T
and the time window ish
, we only consider the data that arrived within the period(Th,T)
. 
time_gap
Type → int
Default →
100
An incremental kmeans is applied on the current set of microclusters after each
time_gap
to form the final macrocluster solution. 
seed
Type → int  None
Default →
None
Random seed used for generating initial centroid positions.

kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶

centers (dict)
Central positions of each cluster.
Examples¶
In the following example, max_micro_clusters
is set relatively low due to the
limited number of training points. Moreover, all points are learnt before any predictions are made.
The halflife
is set at 0.4, to show that you can pass cluster.KMeans
parameters via keyword arguments.
from river import cluster
from river import stream
X = [
[1, 2],
[1, 4],
[1, 0],
[4, 2],
[4, 4],
[4, 0],
[5, 0],
[5, 2],
[5, 4]
]
clustream = cluster.CluStream(
n_macro_clusters=3,
max_micro_clusters=5,
time_gap=3,
seed=0,
halflife=0.4
)
for x, _ in stream.iter_array(X):
clustream.learn_one(x)
clustream.predict_one({0: 1, 1: 1})
1
clustream.predict_one({0: 4, 1: 3})
2
clustream.predict_one({0: 4, 1: 3.5})
0
Methods¶
learn_one
Update the model with a set of features x
.
Parameters
 x — 'dict'
 w — defaults to
1.0
predict_one
Predicts the cluster number for a set of features x
.
Parameters
 x — 'dict'
Returns
int: A cluster number.

Aggarwal, C.C., Philip, S.Y., Han, J. and Wang, J., 2003, A framework for clustering evolving data streams. In Proceedings 2003 VLDB conference (pp. 8192). Morgan Kaufmann. ↩

Chan, T.F., Golub, G.H. and LeVeque, R.J., 1982. Updating formulae and a pairwise algorithm for computing sample variances. In COMPSTAT 1982 5th Symposium held at Toulouse 1982 (pp. 3041). Physica, Heidelberg. https://doi.org/10.1007/9783642514616_3. ↩