DenStream¶
DenStream
DenStream 1 is a clustering algorithm for evolving data streams. DenStream can discover clusters with arbitrary shape and is robust against noise (outliers).
"Dense" micro-clusters (named core-micro-clusters) summarise the clusters of arbitrary shape. A pruning strategy based on the concepts of potential and outlier micro-clusters guarantees the precision of the weights of the micro-clusters with limited memory.
The algorithm is divided into two parts:
Online micro-cluster maintenance (learning)
For a new point p
:
-
Try to merge
p
into either the nearestp-micro-cluster
(potential),o-micro-cluster
(outlier), or create a newo-micro-cluster
and insert it into the outlier buffer. -
For each
T_p
iterations, consider the weights of all potential and outlier micro-clusters. If their weights are smaller than a certain threshold (different for each type of micro-clusters), the micro-cluster is deleted.
Offline generation of clusters on-demand (clustering)
A variant of the DBSCAN algorithm 2 is used, such that all density-connected p-micro-clusters determine the final clusters. Moreover, in order for the algorithm to always be able to generate clusters, a certain number of points must be passed through the algorithm with a suitable streaming speed (number of points passed through within a unit time), indicated by n_samples_init
and stream_speed
.
Parameters¶
-
decaying_factor (float) – defaults to
0.25
Parameter that controls the importance of historical data to current cluster. Note that
decaying_factor
has to be different from0
. -
beta (float) – defaults to
0.75
Parameter to determine the threshold of outlier relative to core micro-clusters. The value of
beta
must be within the range(0,1]
. -
mu (float) – defaults to
2
Parameter to determine the threshold of outliers relative to core micro-cluster. As
beta * mu
must be greater than 1,mu
must be within the range(1/beta, inf)
. -
epsilon (float) – defaults to
0.02
Defines the epsilon neighborhood
-
n_samples_init (int) – defaults to
1000
Number of points to to initiqalize the online process
-
stream_speed (int) – defaults to
100
Number of points arrived in unit time
Attributes¶
-
n_clusters
Number of clusters generated by the algorithm.
-
clusters
A set of final clusters of type
MicroCluster
, which means that these cluster include all the required information, including number of points, creation time, weight, (weighted) linear sum, (weighted) square sum, center and radius. -
p_micro_clusters
The potential core-icro-clusters that are generated by the algorithm. When a generate cluster request arrives, these p-micro-clusters will go through a variant of the DBSCAN algorithm to determine the final clusters.
-
o_micro_clusters
The outlier micro-clusters.
Examples¶
The following example uses the default parameters of the algorithm to test its functionality.
The set of evolving points X
are designed so that clusters are easily identifiable.
>>> from river import cluster
>>> from river import stream
>>> X = [
... [-1, -0.5], [-1, -0.625], [-1, -0.75], [-1, -1], [-1, -1.125],
... [-1, -1.25], [-1.5, -0.5], [-1.5, -0.625], [-1.5, -0.75], [-1.5, -1],
... [-1.5, -1.125], [-1.5, -1.25], [1, 1.5], [1, 1.75], [1, 2],
... [4, 1.25], [4, 1.5], [4, 2.25], [4, 2.5], [4, 3],
... [4, 3.25], [4, 3.5], [4, 3.75], [4, 4],
... ]
>>> denstream = cluster.DenStream(decaying_factor=0.01,
... beta=0.5,
... mu=2.5,
... epsilon=0.5,
... n_samples_init=10)
>>> for x, _ in stream.iter_array(X):
... denstream = denstream.learn_one(x)
>>> denstream.predict_one({0: -1, 1: -2})
1
>>> denstream.predict_one({0:5, 1:4})
2
>>> denstream.predict_one({0:1, 1:1})
0
>>> denstream.n_clusters
3
Methods¶
BufferItem
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
int: A cluster number.
References¶
-
Feng et al (2006, pp 328-339). Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩
-
Ester et al (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD-96 Proceedings, AAAI. ↩