TextClust¶
textClust, a clustering algorithm for text data.
textClust 12 is a stream clustering algorithm for textual data that can identify and track topics over time in a stream of texts. The algorithm uses a widely popular two-phase clustering approach where the stream is first summarised in real-time.
The result is many small preliminary clusters in the stream called micro-clusters
. Micro-clusters maintain enough information to update and efficiently calculate the cosine similarity between them over time, based on the TF-IDF vector of their texts. Upon request, the miro-clusters can be reclustered to generate the final result using any distance-based clustering algorithm, such as hierarchical clustering. To keep the micro-clusters up-to-date, our algorithm applies a fading strategy where micro-clusters that are not updated regularly lose relevance and are eventually removed.
Parameters¶
-
radius – defaults to
0.3
Distance threshold to merge two micro-clusters. Must be within the range
(0, 1]
-
fading_factor – defaults to
0.0005
Fading factor of micro-clusters
-
tgap – defaults to
100
Time between outlier removal
-
term_fading – defaults to
True
Determines whether individual terms should also be faded
-
real_time_fading – defaults to
True
Parameter that specifies whether natural time or the number of observations should be used for fading
-
micro_distance – defaults to
tfidf_cosine_distance
Distance metric used for clustering macro-clusters
-
macro_distance – defaults to
tfidf_cosine_distance
Distance metric used for clustering macro-clusters
-
num_macro – defaults to
3
Number of macro clusters that should be identified during the reclustering phase
-
min_weight – defaults to
0
Minimum weight of micro clusters to be used for reclustering
-
auto_r – defaults to
False
Parameter that specifies if
radius
should be automatically updated -
auto_merge – defaults to
True
Determines, if close observations shall be merged together
-
sigma – defaults to
1
Parameter that influences the automated trheshold adaption technique
Attributes¶
-
micro_clusters
Micro-clusters generated by the algorithm. Micro-clusters are of type
textclust.microcluster
Examples¶
>>> from river import compose
>>> from river import feature_extraction
>>> from river import metrics
>>> from river import cluster
>>> corpus = [
... {"text":'This is the first document.',"idd":1, "cluster": 1, "cluster":1},
... {"text":'This document is the second document.',"idd":2,"cluster": 1},
... {"text":'And this is super unrelated.',"idd":3,"cluster": 2},
... {"text":'Is this the first document?',"idd":4,"cluster": 1},
... {"text":'This is super unrelated as well',"idd":5,"cluster": 2},
... {"text":'Test text',"idd":6,"cluster": 5}
... ]
>>> stopwords = [ 'stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I']
>>> metric = metrics.AdjustedRand()
>>> model = compose.Pipeline(
... feature_extraction.BagOfWords(lowercase=True, ngram_range=(1, 2), stop_words=stopwords),
... cluster.TextClust(real_time_fading=False, fading_factor=0.001, tgap=100, auto_r=True,
... radius=0.9)
... )
>>> for x in corpus:
... y_pred = model.predict_one(x["text"])
... y = x["cluster"]
... metric = metric.update(y,y_pred)
... model = model.learn_one(x["text"])
>>> print(metric)
AdjustedRand: -0.17647058823529413
Methods¶
distances
get_assignment
get_macroclusters
learn_one
Update the model with a set of features x
.
Parameters
- x (dict)
- t – defaults to
None
- sample_weight – defaults to
None
Returns
Clusterer: self
microcluster
predict_one
Predicts the cluster number for a set of features x
.
Parameters
- x (dict)
- sample_weight – defaults to
None
- type – defaults to
micro
Returns
int: A cluster number.
showclusters
tfcontainer
updateMacroClusters
References¶
-
Assenmacher, D. und Trautmann, H. (2022). Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption. In: Asian Conference on Intelligent Information and Database Systems (Accepted) ↩
-
Carnein, M., Assenmacher, D., Trautmann, H. (2017). Stream Clustering of Chat Messages with Applications to Twitch Streams. In: Advances in Conceptual Modeling. ER 2017. ↩