Skip to content

LDA

Online Latent Dirichlet Allocation with Infinite Vocabulary.

Latent Dirichlet allocation (LDA) is a probabilistic approach for exploring topics in document collections. The key advantage of this variant is that it assumes an infinite vocabulary, meaning that the set of tokens does not have to known in advance, as opposed to the implementation from sklearn The results produced by this implementation are identical to those from the original implementation proposed by the method's authors.

This class takes as input token counts. Therefore, it requires you to tokenize beforehand. You can do so by using a feature_extraction.BagOfWords instance, as shown in the example below.

Parameters

  • n_components – defaults to 10

    Number of topics of the latent Drichlet allocation.

  • number_of_documents – defaults to 1000000.0

    Estimated number of documents.

  • alpha_theta – defaults to 0.5

    Hyper-parameter of the Dirichlet distribution of topics.

  • alpha_beta – defaults to 100.0

    Hyper-parameter of the Dirichlet process of distribution over words.

  • tau – defaults to 64.0

    Learning inertia to prevent premature convergence.

  • kappa – defaults to 0.75

    The learning rate kappa controls how quickly new parameters estimates replace the old ones. kappa ∈ (0.5, 1] is required for convergence.

  • vocab_prune_interval – defaults to 10

    Interval at which to refresh the words topics distribution.

  • number_of_samples – defaults to 10

    Number of iteration to computes documents topics distribution.

  • ranking_smooth_factor – defaults to 1e-12

  • burn_in_sweeps – defaults to 5

    Number of iteration necessaries while analyzing a document before updating document topics distribution.

  • maximum_size_vocabulary – defaults to 4000

    Maximum size of the stored vocabulary.

  • seed ('int') – defaults to None

    Random number seed used for reproducibility.

Attributes

  • counter (int)

    The current number of observed documents.

  • truncation_size_prime (int)

    Number of distincts words stored in the vocabulary. Updated before processing a document.

  • truncation_size (int)

    Number of distincts words stored in the vocabulary. Updated after processing a document.

  • word_to_index (dict)

    Words as keys and indexes as values.

  • index_to_word (dict)

    Indexes as keys and words as values.

  • nu_1 (dict)

    Weights of the words. Component of the variational inference.

  • nu_2 (dict)

    Weights of the words. Component of the variational inference.

Examples

>>> from river import compose
>>> from river import feature_extraction
>>> from river import preprocessing

>>> X = [
...    'weather cold',
...    'weather hot dry',
...    'weather cold rainy',
...    'weather hot',
...    'weather cold humid',
... ]

>>> lda = compose.Pipeline(
...     feature_extraction.BagOfWords(),
...     preprocessing.LDA(
...         n_components=2,
...         number_of_documents=60,
...         seed=42
...     )
... )

>>> for x in X:
...     lda = lda.learn_one(x)
...     topics = lda.transform_one(x)
...     print(topics)
{0: 0.5, 1: 2.5}
{0: 1.5, 1: 2.5}
{0: 3.5, 1: 0.5}
{0: 1.5, 1: 1.5}
{0: 2.5, 1: 1.5}

Methods

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

  • x (dict)

Returns

Transformer: self

learn_transform_one

Equivalent to lda.learn_one(x).transform_one(x)s, but faster.

Parameters

  • x ('dict')

Returns

dict: Component attributions for the input document.

transform_one

Transform a set of features x.

Parameters

  • x (dict)

Returns

dict: The transformed values.

References