Skip to content

LDA

Online Latent Dirichlet Allocation with Infinite Vocabulary.

Latent Dirichlet allocation (LDA) is a probabilistic approach for exploring topics in document collections. The key advantage of this variant is that it assumes an infinite vocabulary, meaning that the set of tokens does not have to known in advance, as opposed to the implementation from sklearn The results produced by this implementation are identical to those from the original implementation proposed by the method's authors.

This class takes as input token counts. Therefore, it requires you to tokenize beforehand. You can do so by using a feature_extraction.BagOfWords instance, as shown in the example below.

Parameters

  • n_components

    Default10

    Number of topics of the latent Drichlet allocation.

  • number_of_documents

    Default1000000.0

    Estimated number of documents.

  • alpha_theta

    Default0.5

    Hyper-parameter of the Dirichlet distribution of topics.

  • alpha_beta

    Default100.0

    Hyper-parameter of the Dirichlet process of distribution over words.

  • tau

    Default64.0

    Learning inertia to prevent premature convergence.

  • kappa

    Default0.75

    The learning rate kappa controls how quickly new parameters estimates replace the old ones. kappa ∈ (0.5, 1] is required for convergence.

  • vocab_prune_interval

    Default10

    Interval at which to refresh the words topics distribution.

  • number_of_samples

    Default10

    Number of iteration to computes documents topics distribution.

  • ranking_smooth_factor

    Default1e-12

  • burn_in_sweeps

    Default5

    Number of iteration necessaries while analyzing a document before updating document topics distribution.

  • maximum_size_vocabulary

    Default4000

    Maximum size of the stored vocabulary.

  • seed

    Typeint | None

    DefaultNone

    Random number seed used for reproducibility.

Attributes

  • counter (int)

    The current number of observed documents.

  • truncation_size_prime (int)

    Number of distincts words stored in the vocabulary. Updated before processing a document.

  • truncation_size (int)

    Number of distincts words stored in the vocabulary. Updated after processing a document.

  • word_to_index (dict)

    Words as keys and indexes as values.

  • index_to_word (dict)

    Indexes as keys and words as values.

  • nu_1 (dict)

    Weights of the words. Component of the variational inference.

  • nu_2 (dict)

    Weights of the words. Component of the variational inference.

Examples

from river import compose
from river import feature_extraction
from river import preprocessing

X = [
   'weather cold',
   'weather hot dry',
   'weather cold rainy',
   'weather hot',
   'weather cold humid',
]

lda = compose.Pipeline(
    feature_extraction.BagOfWords(),
    preprocessing.LDA(
        n_components=2,
        number_of_documents=60,
        seed=42
    )
)

for x in X:
    lda = lda.learn_one(x)
    topics = lda.transform_one(x)
    print(topics)
{0: 0.5, 1: 2.5}
{0: 2.499..., 1: 1.5}
{0: 0.5, 1: 3.5}
{0: 0.5, 1: 2.5}
{0: 1.5, 1: 2.5}

Methods

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

  • x'dict'

Returns

Transformer: self

learn_transform_one

Equivalent to lda.learn_one(x).transform_one(x)s, but faster.

Parameters

  • x'dict'

Returns

dict: Component attributions for the input document.

transform_one

Transform a set of features x.

Parameters

  • x'dict'

Returns

dict: The transformed values.