Skip to content

TFIDF

Computes TF-IDF values from sentences.

The TF-IDF formula is the same one as scikit-learn. The only difference is the fact that the document frequencies are determined online, whereas in a batch setting they can be determined by performing an initial pass through the data.

Note that the parameters are identical to those of feature_extraction.BagOfWords.

Parameters

  • normalize

    DefaultTrue

    Whether or not the TF-IDF values by their L2 norm.

  • on

    Typestr | None

    DefaultNone

    The name of the feature that contains the text to vectorize. If None, then the input is treated as a document instead of a set of features.

  • strip_accents

    DefaultTrue

    Whether or not to strip accent characters.

  • lowercase

    DefaultTrue

    Whether or not to convert all characters to lowercase.

  • preprocessor

    Typetyping.Callable | None

    DefaultNone

    An optional preprocessing function which overrides the strip_accents and lowercase steps, while preserving the tokenizing and n-grams generation steps.

  • stop_words

    Typeset[str] | None

    DefaultNone

    An optional set of tokens to remove.

  • tokenizer_pattern

    Default(?u)\b\w[\w\-]+\b

    The tokenization pattern which is used when no tokenizer function is passed. A single capture group may optionally be specified.

  • tokenizer

    Typetyping.Callable | None

    DefaultNone

    A function used to convert preprocessed text into a dict of tokens. By default, a regex formula that works well in most cases is used.

  • ngram_range

    Default(1, 1)

    The lower and upper boundary of the range n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only works if tokenizer is not set to False.

Attributes

  • dfs (collections.defaultdict))

    Document counts.

  • n (int)

    Number of scanned documents.

Examples

from river import feature_extraction

tfidf = feature_extraction.TFIDF()

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

for sentence in corpus:
    tfidf.learn_one(sentence)
    print(tfidf.transform_one(sentence))
{'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
{'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
{'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
{'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}

In the above example, a string is passed to transform_one. You can also indicate which field to access if the string is stored in a dictionary:

tfidf = feature_extraction.TFIDF(on='sentence')

for sentence in corpus:
    x = {'sentence': sentence}
    tfidf.learn_one(x)
    print(tfidf.transform_one(x))
{'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
{'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
{'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
{'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}

Methods

learn_many

Not available, will raise an exception.

Parameters

  • X

learn_one

Update with a set of features x.

A lot of transformers don't actually have to do anything during the learn_one step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one can override this method.

Parameters

  • x'dict'

process_text
transform_many

Not available, will raise an exception.

Parameters

  • X'pd.Series'

transform_one

Transform a set of features x.

Parameters

  • x'dict'

Returns

dict: The transformed values.