DBSTREAM¶
DBSTREAM
DBSTREAM 1 is a clustering algorithm for evolving data streams. It is the first micro-cluster-based online clustering component that explicitely captures the density between micro-clusters via a shared density graph. The density information in the graph is then exploited for reclustering based on actual density between adjacent micro clusters.
The algorithm is divided into two parts:
Online micro-cluster maintenance (learning)
For a new point p
:
-
Find all micro clusters for which
p
falls within the fixed radius (clustering threshold). If no neighbor is found, a new micro cluster with a weight of 1 is created forp
. -
If no neighbor is found, a new micro cluster with a weight of 1 is created for
p
. If one or more neighbors ofp
are found, we update the micro clusters by applying the appropriate fading, increasing their weight and then we try to move them closer top
using the Gaussian neighborhood function. -
Next, the shared density graph is updated. To prevent collapsing micro clusters, we will restrict the movement for micro clusters in case they come closer than \(r\) (clustering threshold) to each other. Finishing this process, the time stamp is also increased by 1.
-
Finally, the cleanup will be processed. It is executed every
t_gap
time steps, removing weak micro clusters and weak entries in the shared density graph to recover memory and improve the clustering algorithm's processing speed.
Offline generation of macro clusters (clustering)
The offline generation of macro clusters is generated through the two following steps:
-
The connectivity graph
C
is constructed using shared density entries between strong micro clusters. The edges in this connectivity graph with a connectivity value greater than the intersection threshold (\(\alpha\)) are used to find connected components representing the final cluster. -
After the connectivity graph is generated, a variant of the DBSCAN algorithm proposed by Ester et al. is applied to form all macro clusters from \(\alpha\)-connected micro clusters.
Parameters¶
-
clustering_threshold (float) – defaults to
1.0
DBStream represents each micro cluster by a leader (a data point defining the micro cluster's center) and the density in an area of a user-specified radius \(r\) (
clustering_threshold
) around the center. -
fading_factor (float) – defaults to
0.01
Parameter that controls the importance of historical data to current cluster. Note that
fading_factor
has to be different from0
. -
cleanup_interval (float) – defaults to
2
The time interval between two consecutive time points when the cleanup process is conducted.
-
intersection_factor (float) – defaults to
0.3
The intersection factor related to the area of the overlap of the micro clusters relative to the area cover by micro clusters. This parameter is used to determine whether a micro cluster or a shared density is weak.
-
minimum_weight (float) – defaults to
1.0
The minimum weight for a cluster to be not "noisy".
Attributes¶
-
n_clusters
Number of clusters generated by the algorithm.
-
clusters
A set of final clusters of type
DBStreamMicroCluster
. However, these are either micro clusters, or macro clusters that are generated by merging all \(\alpha\)-connected micro clusters. This set is generated through the offline phase of the algorithm. -
centers
Final clusters' centers.
-
micro_clusters
Micro clusters generated by the algorithm. Instead of updating directly the new instance points into a nearest micro cluster, through each iteration, the weight and center will be modified so that the clusters are closer to the new points, using the Gaussian neighborhood function.
Examples¶
>>> from river import cluster
>>> from river import stream
>>> X = [
... [1, 0.5], [1, 0.625], [1, 0.75], [1, 1.125], [1, 1.5], [1, 1.75],
... [4, 1.5], [4, 2.25], [4, 2.5], [4, 3], [4, 3.25], [4, 3.5]
... ]
>>> dbstream = cluster.DBSTREAM(clustering_threshold = 1.5,
... fading_factor = 0.05,
... cleanup_interval = 4,
... intersection_factor = 0.5,
... minimum_weight = 1)
>>> for x, _ in stream.iter_array(X):
... dbstream = dbstream.learn_one(x)
>>> dbstream.predict_one({0: 1, 1: 2})
0
>>> dbstream.predict_one({0: 5, 1: 2})
1
>>> dbstream.n_clusters
2
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- sample_weight (int) – defaults to
None
Returns
Clusterer: self
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
Returns
int: A cluster number.
References¶
-
Michael Hahsler and Matthew Bolanos (2016, pp 1449-1461). Clsutering Data Streams Based on Shared Density between Micro-Clusters, IEEE Transactions on Knowledge and Data Engineering 28(6) . In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20–22, 2006, Bethesda, MD, USA. ↩
-
Ester et al (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD-96 Proceedings, AAAI. ↩