LDA¶

Online Latent Dirichlet Allocation with Infinite Vocabulary.

Latent Dirichlet allocation (LDA) is a probabilistic approach for exploring topics in document collections. The key advantage of this variant is that it assumes an infinite vocabulary, meaning that the set of tokens does not have to known in advance, as opposed to the implementation from sklearn The results produced by this implementation are identical to those from the original implementation proposed by the method's authors.

This class takes as input token counts. Therefore, it requires you to tokenize beforehand. You can do so by using a feature_extraction.BagOfWords instance, as shown in the example below.

Parameters¶

n_components

Default → 10

Number of topics of the latent Drichlet allocation.
number_of_documents

Default → 1000000.0

Estimated number of documents.
alpha_theta

Default → 0.5

Hyper-parameter of the Dirichlet distribution of topics.
alpha_beta

Default → 100.0

Hyper-parameter of the Dirichlet process of distribution over words.
tau

Default → 64.0

Learning inertia to prevent premature convergence.
kappa

Default → 0.75

The learning rate kappa controls how quickly new parameters estimates replace the old ones. kappa ∈ (0.5, 1] is required for convergence.
vocab_prune_interval

Default → 10

Interval at which to refresh the words topics distribution.
number_of_samples

Default → 10

Number of iteration to computes documents topics distribution.
ranking_smooth_factor

Default → 1e-12
burn_in_sweeps

Default → 5

Number of iteration necessaries while analyzing a document before updating document topics distribution.
maximum_size_vocabulary

Default → 4000

Maximum size of the stored vocabulary.
seed

Type → int | None

Default → None

Random number seed used for reproducibility.

Attributes¶

counter (int)

The current number of observed documents.
truncation_size_prime (int)

Number of distincts words stored in the vocabulary. Updated before processing a document.
truncation_size (int)

Number of distincts words stored in the vocabulary. Updated after processing a document.
word_to_index (dict)

Words as keys and indexes as values.
index_to_word (dict)

Indexes as keys and words as values.
nu_1 (dict)

Weights of the words. Component of the variational inference.
nu_2 (dict)

Weights of the words. Component of the variational inference.

Examples¶

from river import compose
from river import feature_extraction
from river import preprocessing

X = [
   'weather cold',
   'weather hot dry',
   'weather cold rainy',
   'weather hot',
   'weather cold humid',
]

lda = compose.Pipeline(
    feature_extraction.BagOfWords(),
    preprocessing.LDA(
        n_components=2,
        number_of_documents=60,
        seed=42
    )
)

for x in X:
    lda.learn_one(x)
    topics = lda.transform_one(x)
    print(topics)

{0: 0.5, 1: 2.5}
{0: 2.499..., 1: 1.5}
{0: 0.5, 1: 3.5}
{0: 0.5, 1: 2.5}
{0: 1.5, 1: 2.5}

Methods¶

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

x — 'dict[base.typing.FeatureName, Any]'

learn_transform_one

Equivalent to lda.learn_one(x).transform_one(x)s, but faster.

Parameters

x — 'dict'

Returns

dict: Component attributions for the input document.

transform_one

Transform a set of features x.

Parameters

x — 'dict[base.typing.FeatureName, Any]'

Returns

dict[base.typing.FeatureName, Any]: The transformed values.