DenStream¶
DenStream
DenStream 1 is a clustering algorithm for evolving data streams. DenStream can discover clusters with arbitrary shape and is robust against noise (outliers).
"Dense" micro-clusters (named core-micro-clusters) summarise the clusters of arbitrary shape. A pruning strategy based on the concepts of potential and outlier micro-clusters guarantees the precision of the weights of the micro-clusters with limited memory.
The algorithm is divided into two parts:
Online micro-cluster maintenance (learning)
For a new point p
:
-
Try to merge
p
into either the nearestp-micro-cluster
(potential),o-micro-cluster
(outlier), or create a newo-micro-cluster
and insert it into the outlier buffer. -
For each
T_p
iterations, consider the weights of all potential and outlier micro-clusters. If their weights are smaller than a certain threshold (different for each type of micro-clusters), the micro-cluster is deleted.
Offline generation of clusters on-demand (clustering)
A variant of the DBSCAN algorithm 2 is used, such that all density-connected p-micro-clusters determine the final clusters. Moreover, in order for the algorithm to always be able to generate clusters, a certain number of points must be passed through the algorithm with a suitable streaming speed (number of points passed through within a unit time), indicated by n_samples_init
and stream_speed
.
Parameters¶
-
decaying_factor
Type → float
Default →
0.25
Parameter that controls the importance of historical data to current cluster. Note that
decaying_factor
has to be different from0
. -
beta
Type → float
Default →
0.75
Parameter to determine the threshold of outlier relative to core micro-clusters. The value of
beta
must be within the range(0,1]
. -
mu
Type → float
Default →
2
Parameter to determine the threshold of outliers relative to core micro-cluster. As
beta * mu
must be greater than 1,mu
must be within the range(1/beta, inf)
. -
epsilon
Type → float
Default →
0.02
Defines the epsilon neighborhood
-
n_samples_init
Type → int
Default →
1000
Number of points to to initiqalize the online process
-
stream_speed
Type → int
Default →
100
Number of points arrived in unit time
Attributes¶
-
n_clusters
Number of clusters generated by the algorithm.
-
clusters
A set of final clusters of type
MicroCluster
, which means that these cluster include all the required information, including number of points, creation time, weight, (weighted) linear sum, (weighted) square sum, center and radius. -
p_micro_clusters
The potential core-icro-clusters that are generated by the algorithm. When a generate cluster request arrives, these p-micro-clusters will go through a variant of the DBSCAN algorithm to determine the final clusters.
-
o_micro_clusters
The outlier micro-clusters.
Examples¶
The following example uses the default parameters of the algorithm to test its functionality.
The set of evolving points X
are designed so that clusters are easily identifiable.
from river import cluster
from river import stream
X = [
[-1, -0.5], [-1, -0.625], [-1, -0.75], [-1, -1], [-1, -1.125],
[-1, -1.25], [-1.5, -0.5], [-1.5, -0.625], [-1.5, -0.75], [-1.5, -1],
[-1.5, -1.125], [-1.5, -1.25], [1, 1.5], [1, 1.75], [1, 2],
[4, 1.25], [4, 1.5], [4, 2.25], [4, 2.5], [4, 3],
[4, 3.25], [4, 3.5], [4, 3.75], [4, 4],
]
denstream = cluster.DenStream(decaying_factor=0.01,
beta=0.5,
mu=2.5,
epsilon=0.5,
n_samples_init=10)
for x, _ in stream.iter_array(X):
denstream.learn_one(x)
denstream.predict_one({0: -1, 1: -2})
0
denstream.predict_one({0: 5, 1: 4})
1
denstream.predict_one({0: 1, 1: 1})
0
denstream.n_clusters
2
Methods¶
BufferItem
learn_one
Update the model with a set of features x
.
Parameters
- x — 'dict'
- w — defaults to
None
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x — 'dict'
- w — defaults to
None
Returns
int: A cluster number.
-
Feng et al (2006, pp 328-339). Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩
-
Ester et al (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD-96 Proceedings, AAAI. ↩