LDA¶
Online Latent Dirichlet Allocation with Infinite Vocabulary.
Latent Dirichlet allocation (LDA) is a probabilistic approach for exploring topics in document collections. The key advantage of this variant is that it assumes an infinite vocabulary, meaning that the set of tokens does not have to known in advance, as opposed to the implementation from sklearn The results produced by this implementation are identical to those from the original implementation proposed by the method's authors.
This class takes as input token counts. Therefore, it requires you to tokenize beforehand. You can do so by using a feature_extraction.BagOfWords
instance, as shown in the example below.
Parameters¶
-
n_components
Default →
10
Number of topics of the latent Drichlet allocation.
-
number_of_documents
Default →
1000000.0
Estimated number of documents.
-
alpha_theta
Default →
0.5
Hyper-parameter of the Dirichlet distribution of topics.
-
alpha_beta
Default →
100.0
Hyper-parameter of the Dirichlet process of distribution over words.
-
tau
Default →
64.0
Learning inertia to prevent premature convergence.
-
kappa
Default →
0.75
The learning rate kappa controls how quickly new parameters estimates replace the old ones. kappa ∈ (0.5, 1] is required for convergence.
-
vocab_prune_interval
Default →
10
Interval at which to refresh the words topics distribution.
-
number_of_samples
Default →
10
Number of iteration to computes documents topics distribution.
-
ranking_smooth_factor
Default →
1e-12
-
burn_in_sweeps
Default →
5
Number of iteration necessaries while analyzing a document before updating document topics distribution.
-
maximum_size_vocabulary
Default →
4000
Maximum size of the stored vocabulary.
-
seed
Type → int | None
Default →
None
Random number seed used for reproducibility.
Attributes¶
-
counter (int)
The current number of observed documents.
-
truncation_size_prime (int)
Number of distincts words stored in the vocabulary. Updated before processing a document.
-
truncation_size (int)
Number of distincts words stored in the vocabulary. Updated after processing a document.
-
word_to_index (dict)
Words as keys and indexes as values.
-
index_to_word (dict)
Indexes as keys and words as values.
-
nu_1 (dict)
Weights of the words. Component of the variational inference.
-
nu_2 (dict)
Weights of the words. Component of the variational inference.
Examples¶
from river import compose
from river import feature_extraction
from river import preprocessing
X = [
'weather cold',
'weather hot dry',
'weather cold rainy',
'weather hot',
'weather cold humid',
]
lda = compose.Pipeline(
feature_extraction.BagOfWords(),
preprocessing.LDA(
n_components=2,
number_of_documents=60,
seed=42
)
)
for x in X:
lda.learn_one(x)
topics = lda.transform_one(x)
print(topics)
{0: 0.5, 1: 2.5}
{0: 2.499..., 1: 1.5}
{0: 0.5, 1: 3.5}
{0: 0.5, 1: 2.5}
{0: 1.5, 1: 2.5}
Methods¶
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x — 'dict'
learn_transform_one
Equivalent to lda.learn_one(x).transform_one(x)
s, but faster.
Parameters
- x — 'dict'
Returns
dict: Component attributions for the input document.
transform_one
Transform a set of features x
.
Parameters
- x — 'dict'
Returns
dict: The transformed values.