CluStream¶
CluStream
The CluStream algorithm 1 maintains statistical information about the data using micro-clusters. These micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. This pattern allows to recall summary statistics from different time horizons.
Training with a new point p is performed in two main tasks:
-
Determinate closest micro-cluster to
p -
Check whether
pfits (memory) into the closest micro-cluster:-
if
pfits, put into micro-cluster -
if
pdoes not fit, free some space to insert a new micro-cluster. This is done in two ways, delete an old micro-cluster or merge the two micro-clusters closest to each other.
-
Parameters¶
-
seed (int) – defaults to
NoneRandom seed used for generating initial centroid positions.
-
time_window (int) – defaults to
1000If the current time is
Tand the time window ish, we only consider the data that arrived within the period(T-h,T). -
max_micro_clusters (int) – defaults to
100The maximum number of micro-clusters to use.
-
micro_cluster_r_factor (int) – defaults to
2Multiplier for the micro-cluster radius. When deciding to add a new data point to a micro-cluster, the maximum boundary is defined as a factor of the
micro_cluster_r_factorof the RMS deviation of the data points in the micro-cluster from the centroid. -
n_macro_clusters (int) – defaults to
5The number of clusters (k) for the k-means algorithm.
-
kwargs
Other parameters passed to the incremental kmeans at
cluster.KMeans.
Attributes¶
-
centers (dict)
Central positions of each cluster.
Examples¶
In the following example, max_micro_clusters and time_window are set
relatively low due to the limited number of training points.
Moreover, all points are learnt before any predictions are made.
The halflife is set at 0.4, to show that you can pass cluster.KMeans
parameters via keyword arguments.
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 2],
... [1, 4],
... [1, 0],
... [4, 2],
... [4, 4],
... [4, 0]
... ]
>>> clustream = cluster.CluStream(time_window=1,
... max_micro_clusters=3,
... n_macro_clusters=2,
... seed=0,
... halflife=0.4)
>>> for i, (x, _) in enumerate(stream.iter_array(X)):
... clustream = clustream.learn_one(x)
>>> clustream.predict_one({0: 1, 1: 1})
1
>>> clustream.predict_one({0: 4, 1: 3})
0
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update the model with a set of features x.
Parameters
- x (dict)
- sample_weight (int) – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x.
Parameters
- x (dict)
Returns
int: A cluster number.
References¶
-
Aggarwal, C.C., Philip, S.Y., Han, J. and Wang, J., 2003, A framework for clustering evolving data streams. In Proceedings 2003 VLDB conference (pp. 81-92). Morgan Kaufmann. ↩