DenStream¶

DenStream

DenStream ¹ is a clustering algorithm for evolving data streams. DenStream can discover clusters with arbitrary shape and is robust against noise (outliers).

"Dense" micro-clusters (named core-micro-clusters) summarise the clusters of arbitrary shape. A pruning strategy based on the concepts of potential and outlier micro-clusters guarantees the precision of the weights of the micro-clusters with limited memory.

The algorithm is divided into two parts:

Online micro-cluster maintenance (learning)

For a new point p:

Try to merge p into either the nearest p-micro-cluster (potential), o-micro-cluster (outlier), or create a new o-micro-cluster and insert it into the outlier buffer.
For each T_p iterations, consider the weights of all potential and outlier micro-clusters. If their weights are smaller than a certain threshold (different for each type of micro-clusters), the micro-cluster is deleted.

Offline generation of clusters on-demand (clustering)

A variant of the DBSCAN algorithm ² is used, such that all density-connected p-micro-clusters determine the final clusters. Moreover, in order for the algorithm to always be able to generate clusters, a certain number of points must be passed through the algorithm with a suitable streaming speed (number of points passed through within a unit time), indicated by n_samples_init and stream_speed.

Parameters¶

decaying_factor (float) – defaults to 0.25

Parameter that controls the importance of historical data to current cluster. Note that decaying_factor has to be different from 0.
beta (float) – defaults to 0.75

Parameter to determine the threshold of outlier relative to core micro-clusters. The value of beta must be within the range (0,1].
mu (float) – defaults to 2

Parameter to determine the threshold of outliers relative to core micro-cluster. As beta * mu must be greater than 1, mu must be within the range (1/beta, inf).
epsilon (float) – defaults to 0.02

Defines the epsilon neighborhood
n_samples_init (int) – defaults to 1000

Number of points to to initiqalize the online process
stream_speed (int) – defaults to 100

Number of points arrived in unit time

Attributes¶

n_clusters

Number of clusters generated by the algorithm.
clusters

A set of final clusters of type MicroCluster, which means that these cluster include all the required information, including number of points, creation time, weight, (weighted) linear sum, (weighted) square sum, center and radius.
p_micro_clusters

The potential core-icro-clusters that are generated by the algorithm. When a generate cluster request arrives, these p-micro-clusters will go through a variant of the DBSCAN algorithm to determine the final clusters.
o_micro_clusters

The outlier micro-clusters.

Examples¶

The following example uses the default parameters of the algorithm to test its functionality. The set of evolving points X are designed so that clusters are easily identifiable.

>>> from river import cluster
>>> from river import stream

>>> X = [
...     [-1, -0.5], [-1, -0.625], [-1, -0.75], [-1, -1], [-1, -1.125],
...     [-1, -1.25], [-1.5, -0.5], [-1.5, -0.625], [-1.5, -0.75], [-1.5, -1],
...     [-1.5, -1.125], [-1.5, -1.25], [1, 1.5], [1, 1.75], [1, 2],
...     [4, 1.25], [4, 1.5], [4, 2.25], [4, 2.5], [4, 3],
...     [4, 3.25], [4, 3.5], [4, 3.75], [4, 4],
... ]

>>> denstream = cluster.DenStream(decaying_factor=0.01,
...                               beta=0.5,
...                               mu=2.5,
...                               epsilon=0.5,
...                               n_samples_init=10)

>>> for x, _ in stream.iter_array(X):
...     denstream = denstream.learn_one(x)

>>> denstream.predict_one({0: -1, 1: -2})
1

>>> denstream.predict_one({0:5, 1:4})
2

>>> denstream.predict_one({0:1, 1:1})
0

>>> denstream.n_clusters
3

Methods¶

BufferItem

learn_one

Update the model with a set of features x.

Parameters

x (dict)
sample_weight – defaults to None

Returns

Clusterer: self

predict_one

Predicts the cluster number for a set of features x.

Parameters

x (dict)
sample_weight – defaults to None

Returns

int: A cluster number.

References¶

Feng et al (2006, pp 328-339). Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩
Ester et al (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD-96 Proceedings, AAAI. ↩