CluStream¶
CluStream
The CluStream algorithm 1 maintains statistical information about the data using micro-clusters. These micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.
Training with a new point p
is performed in two main tasks:
-
Determinate closest micro-cluster to
p
-
Check whether
p
fits (memory) into the closest micro-cluster:-
if
p
fits, put into micro-cluster -
if
p
does not fit, free some space to insert a new micro-cluster. This is done in two ways, delete an old micro-cluster or merge the two micro-clusters closest to each other.
-
Parameters¶
-
seed (int) – defaults to
None
Random seed used for generating initial centroid positions.
-
time_window (int) – defaults to
1000
If the current time is
T
and the time window ish
, we only consider the data that arrived within the period(T-h,T)
. -
max_micro_clusters (int) – defaults to
100
The maximum number of micro-clusters to use.
-
micro_cluster_r_factor (int) – defaults to
2
Multiplier for the micro-cluster radius. When deciding to add a new data point to a micro-cluster, the maximum boundary is defined as a factor of the
micro_cluster_r_factor
of the RMS deviation of the data points in the micro-cluster from the centroid. -
n_macro_clusters (int) – defaults to
5
The number of clusters (k) for the k-means algorithm.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans
.
Attributes¶
-
centers (dict)
Central positions of each cluster.
Examples¶
In the following example, max_micro_clusters
and time_window
are set
relatively low due to the limited number of training points.
Moreover, all points are learnt before any predictions are made.
The halflife
is set at 0.4, to show that you can pass cluster.KMeans
parameters via keyword arguments.
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 2],
... [1, 4],
... [1, 0],
... [4, 2],
... [4, 4],
... [4, 0]
... ]
>>> clustream = cluster.CluStream(time_window=1,
... max_micro_clusters=3,
... n_macro_clusters=2,
... seed=0,
... halflife=0.4)
>>> for i, (x, _) in enumerate(stream.iter_array(X)):
... clustream = clustream.learn_one(x)
>>> clustream.predict_one({0: 1, 1: 1})
1
>>> clustream.predict_one({0: 4, 1: 3})
0
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight (int) – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
Returns
int: A cluster number.
References¶
-
Aggarwal, C.C., Philip, S.Y., Han, J. and Wang, J., 2003, A framework for clustering evolving data streams. In Proceedings 2003 VLDB conference (pp. 81-92). Morgan Kaufmann. ↩