DenStream¶
DenStream
DenStream 1 is a clustering algorithm for evolving data streams. DenStream can discover clusters with arbitrary shape and is robust against noise (outliers).
"Dense" micro-clusters (named core-micro-clusters) summarise the clusters of arbitrary shape. A pruning strategy based on the concepts of potential and outlier micro-clusters guarantees the precision of the weights of the micro-clusters with limited memory.
The algorithm is divided into two parts:
Online micro-cluster maintenance (learning)
For a new point p
:
-
Try to merge
p
into either the nearestp-micro-cluster
(potential),o-micro-cluster
(outlier), or create a newo-micro-cluster
and insert it into the outlier buffer. -
For each
T_p
iterations, consider the weights of all potential and outlier micro-clusters. If their weights are smaller than a certain threshold (different for each type of micro-clusters), the micro-cluster is deleted.
Offline generation of clusters on-demand (clustering)
A variant of the DBSCAN algorithm 2 is used, such that all density-connected p-micro-clusters determine the final clusters.
Parameters¶
-
decaying_factor (float) – defaults to
0.25
Parameter that controls the importance of historical data to current cluster. Note that
decaying_factor
has to be different from0
. -
core_weight_threshold (float) – defaults to
5
Parameter to determine the threshold of outlier relative to core micro-clusters. Note that
core_weight_threshold * tolerance_factor
has to be greater than1
or less than0
. -
tolerance_factor (float) – defaults to
0.5
Parameter to determine the threshold of outliers relative to core micro-cluster. In a normal setting, this parameter is usuallly set within the range
[0,1]
. Once again, note thatcore_weight_threshold * tolerance_factor
has to be greater than1
or less than0
. -
radius (float) – defaults to
2
This parameter is passed onto the DBSCAN offline algorithm as the \(\epsilon\) parameter when a clustering request arrives.
Attributes¶
-
n_clusters
Number of clusters generated by the algorithm.
-
clusters
A set of final clusters of type
MicroCluster
, which means that these cluster include all the required information, including number of points, creation time, weight, (weighted) linear sum, (weighted) square sum, center and radius. -
p_micro_clusters
The p micro-clusters that are generated by the algorithm. When a generating cluster request arrives, these p-micro-clusters will go through a variant of DBSCAN algorithm to determine the final clusters.
-
o_micro_clusters
The outlier buffer, separating the processing of the potential core-micro-cluster and outlier-micro-clusters.
Examples¶
The following example uses the default parameters of the algorithm to test its functionality. It can easily be seen
that the set of evolving points X
are designed so that there can be a clear picture drawn on how the clusters can
be generated.
>>> from river import cluster
>>> from river import stream
>>> X = [
... [-1, -0.5], [-1, -0.625], [-1, -0.75], [-1, -1], [-1, -1.125], [-1, -1.25],
... [-1.5, -0.5], [-1.5, -0.625], [-1.5, -0.75], [-1.5, -1], [-1.5, -1.125], [-1.5, -1.25],
... [1, 1.5], [1, 1.75], [1, 2], [4, 1.25], [4, 1.5], [4, 2.25],
... [4, 2.5], [4, 3], [4, 3.25], [4, 3.5], [4, 3.75], [4, 4],
... ]
>>> denstream = cluster.DenStream(decaying_factor = 0.01,
... core_weight_threshold = 1.01,
... tolerance_factor = 1.0005,
... radius = 0.5)
>>> for x, _ in stream.iter_array(X):
... denstream = denstream.learn_one(x)
>>> denstream.predict_one({0: -1, 1: -2})
0
>>> denstream.predict_one({0:5, 1:4})
1
>>> denstream.n_clusters
2
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight (int) – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
int: A cluster number.
References¶
-
Feng et al (2006, pp 328-339). Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩
-
Ester et al (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD-96 Proceedings, AAAI. ↩