Skip to content

TextClust

textClust, a clustering algorithm for text data.

textClust 12 is a stream clustering algorithm for textual data that can identify and track topics over time in a stream of texts. The algorithm uses a widely popular two-phase clustering approach where the stream is first summarised in real-time.

The result is many small preliminary clusters in the stream called micro-clusters. Micro-clusters maintain enough information to update and efficiently calculate the cosine similarity between them over time, based on the TF-IDF vector of their texts. Upon request, the miro-clusters can be reclustered to generate the final result using any distance-based clustering algorithm, such as hierarchical clustering. To keep the micro-clusters up-to-date, our algorithm applies a fading strategy where micro-clusters that are not updated regularly lose relevance and are eventually removed.

Parameters

  • radius

    Default0.3

    Distance threshold to merge two micro-clusters. Must be within the range (0, 1]

  • fading_factor

    Default0.0005

    Fading factor of micro-clusters

  • tgap

    Default100

    Time between outlier removal

  • term_fading

    DefaultTrue

    Determines whether individual terms should also be faded

  • real_time_fading

    DefaultTrue

    Parameter that specifies whether natural time or the number of observations should be used for fading

  • micro_distance

    Defaulttfidf_cosine_distance

    Distance metric used for clustering macro-clusters

  • macro_distance

    Defaulttfidf_cosine_distance

    Distance metric used for clustering macro-clusters

  • num_macro

    Default3

    Number of macro clusters that should be identified during the reclustering phase

  • min_weight

    Default0

    Minimum weight of micro clusters to be used for reclustering

  • auto_r

    DefaultFalse

    Parameter that specifies if radius should be automatically updated

  • auto_merge

    DefaultTrue

    Determines, if close observations shall be merged together

  • sigma

    Default1

    Parameter that influences the automated trheshold adaption technique

Attributes

  • micro_clusters

    Micro-clusters generated by the algorithm. Micro-clusters are of type textclust.microcluster

Examples

from river import compose
from river import feature_extraction
from river import metrics
from river import cluster

corpus = [
   {"text":'This is the first document.',"idd":1, "cluster": 1, "cluster":1},
   {"text":'This document is the second document.',"idd":2,"cluster": 1},
   {"text":'And this is super unrelated.',"idd":3,"cluster": 2},
   {"text":'Is this the first document?',"idd":4,"cluster": 1},
   {"text":'This is super unrelated as well',"idd":5,"cluster": 2},
   {"text":'Test text',"idd":6,"cluster": 5}
]

stopwords = [ 'stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I']

metric = metrics.AdjustedRand()

model = compose.Pipeline(
    feature_extraction.BagOfWords(lowercase=True, ngram_range=(1, 2), stop_words=stopwords),
    cluster.TextClust(real_time_fading=False, fading_factor=0.001, tgap=100, auto_r=True,
    radius=0.9)
)

for x in corpus:
    y_pred = model.predict_one(x["text"])
    y = x["cluster"]
    metric.update(y,y_pred)
    model.learn_one(x["text"])

print(metric)
AdjustedRand: -0.17647058823529413

Methods

distances
get_assignment
get_macroclusters
learn_one

Update the model with a set of features x.

Parameters

  • x'dict'
  • t — defaults to None
  • w — defaults to None

microcluster
predict_one

Predicts the cluster number for a set of features x.

Parameters

  • x'dict'
  • w — defaults to None
  • type — defaults to micro

Returns

int: A cluster number.

showclusters
tfcontainer
updateMacroClusters

  1. Assenmacher, D. und Trautmann, H. (2022). Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption. In: Asian Conference on Intelligent Information and Database Systems (Accepted) 

  2. Carnein, M., Assenmacher, D., Trautmann, H. (2017). Stream Clustering of Chat Messages with Applications to Twitch Streams. In: Advances in Conceptual Modeling. ER 2017.