DenStream¶
DenStream
DenStream ^{1} is a clustering algorithm for evolving data streams. DenStream can discover clusters with arbitrary shape and is robust against noise (outliers).
"Dense" microclusters (named coremicroclusters) summarise the clusters of arbitrary shape. A pruning strategy based on the concepts of potential and outlier microclusters guarantees the precision of the weights of the microclusters with limited memory.
The algorithm is divided into two parts:
Online microcluster maintenance (learning)
For a new point p
:

Try to merge
p
into either the nearestpmicrocluster
(potential),omicrocluster
(outlier), or create a newomicrocluster
and insert it into the outlier buffer. 
For each
T_p
iterations, consider the weights of all potential and outlier microclusters. If their weights are smaller than a certain threshold (different for each type of microclusters), the microcluster is deleted.
Offline generation of clusters ondemand (clustering)
A variant of the DBSCAN algorithm ^{2} is used, such that all densityconnected pmicroclusters determine the final clusters. Moreover, in order for the algorithm to always be able to generate clusters, a certain number of points must be passed through the algorithm with a suitable streaming speed (number of points passed through within a unit time), indicated by n_samples_init
and stream_speed
.
Parameters¶

decaying_factor
Type → float
Default →
0.25
Parameter that controls the importance of historical data to current cluster. Note that
decaying_factor
has to be different from0
. 
beta
Type → float
Default →
0.75
Parameter to determine the threshold of outlier relative to core microclusters. The value of
beta
must be within the range(0,1]
. 
mu
Type → float
Default →
2
Parameter to determine the threshold of outliers relative to core microcluster. As
beta * mu
must be greater than 1,mu
must be within the range(1/beta, inf)
. 
epsilon
Type → float
Default →
0.02
Defines the epsilon neighborhood

n_samples_init
Type → int
Default →
1000
Number of points to to initiqalize the online process

stream_speed
Type → int
Default →
100
Number of points arrived in unit time
Attributes¶

n_clusters
Number of clusters generated by the algorithm.

clusters
A set of final clusters of type
MicroCluster
, which means that these cluster include all the required information, including number of points, creation time, weight, (weighted) linear sum, (weighted) square sum, center and radius. 
p_micro_clusters
The potential coreicroclusters that are generated by the algorithm. When a generate cluster request arrives, these pmicroclusters will go through a variant of the DBSCAN algorithm to determine the final clusters.

o_micro_clusters
The outlier microclusters.
Examples¶
The following example uses the default parameters of the algorithm to test its functionality.
The set of evolving points X
are designed so that clusters are easily identifiable.
from river import cluster
from river import stream
X = [
[1, 0.5], [1, 0.625], [1, 0.75], [1, 1], [1, 1.125],
[1, 1.25], [1.5, 0.5], [1.5, 0.625], [1.5, 0.75], [1.5, 1],
[1.5, 1.125], [1.5, 1.25], [1, 1.5], [1, 1.75], [1, 2],
[4, 1.25], [4, 1.5], [4, 2.25], [4, 2.5], [4, 3],
[4, 3.25], [4, 3.5], [4, 3.75], [4, 4],
]
denstream = cluster.DenStream(decaying_factor=0.01,
beta=0.5,
mu=2.5,
epsilon=0.5,
n_samples_init=10)
for x, _ in stream.iter_array(X):
denstream.learn_one(x)
denstream.predict_one({0: 1, 1: 2})
1
denstream.predict_one({0: 5, 1: 4})
2
denstream.predict_one({0: 1, 1: 1})
0
denstream.n_clusters
3
Methods¶
BufferItem
learn_one
Update the model with a set of features x
.
Parameters
 x — 'dict'
 w — defaults to
None
predict_one
Predicts the cluster number for a set of features x
.
Parameters
 x — 'dict'
 w — defaults to
None
Returns
int: A cluster number.

Feng et al (2006, pp 328339). DensityBased Clustering over an Evolving Data Stream with Noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩

Ester et al (1996). A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD96 Proceedings, AAAI. ↩