CluStream¶

CluStream

The CluStream algorithm ¹ maintains statistical information about the data using micro-clusters. These micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.

Training with a new point p is performed in two main tasks:

Determinate closest micro-cluster to p
Check whether p fits (memory) into the closest micro-cluster:
- if p fits, put into micro-cluster
- if p does not fit, free some space to insert a new micro-cluster. This is done in two ways, delete an old micro-cluster or merge the two micro-clusters closest to each other.

Parameters¶

seed (int) – defaults to None

Random seed used for generating initial centroid positions.
time_window (int) – defaults to 1000

If the current time is T and the time window is h, we only consider the data that arrived within the period (T-h,T).
max_micro_clusters (int) – defaults to 100

The maximum number of micro-clusters to use.
micro_cluster_r_factor (int) – defaults to 2

Multiplier for the micro-cluster radius. When deciding to add a new data point to a micro-cluster, the maximum boundary is defined as a factor of the micro_cluster_r_factor of the RMS deviation of the data points in the micro-cluster from the centroid.
n_macro_clusters (int) – defaults to 5

The number of clusters (k) for the k-means algorithm.
kwargs

Other parameters passed to the incremental kmeans at cluster.KMeans.

Attributes¶

centers (dict)

Central positions of each cluster.

Examples¶

In the following example, max_micro_clusters and time_window are set relatively low due to the limited number of training points. Moreover, all points are learnt before any predictions are made. The halflife is set at 0.4, to show that you can pass cluster.KMeans parameters via keyword arguments.

>>> from river import cluster
>>> from river import stream

>>> X = [
...     [1, 2],
...     [1, 4],
...     [1, 0],
...     [4, 2],
...     [4, 4],
...     [4, 0]
... ]

>>> clustream = cluster.CluStream(time_window=1,
...                               max_micro_clusters=3,
...                               n_macro_clusters=2,
...                               seed=0,
...                               halflife=0.4)

>>> for i, (x, _) in enumerate(stream.iter_array(X)):
...     clustream = clustream.learn_one(x)

>>> clustream.predict_one({0: 1, 1: 1})
1

>>> clustream.predict_one({0: 4, 1: 3})
0

Methods¶

clone

Return a fresh estimator with the same parameters.

The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.

learn_one

Update the model with a set of features x.

Parameters

x (dict)
sample_weight (int) – defaults to None

Returns

Clusterer: self

predict_one

Predicts the cluster number for a set of features x.

Parameters

x (dict)

Returns

int: A cluster number.

References¶

Aggarwal, C.C., Philip, S.Y., Han, J. and Wang, J., 2003, A framework for clustering evolving data streams. In Proceedings 2003 VLDB conference (pp. 81-92). Morgan Kaufmann. ↩