TFIDF¶
Computes TF-IDF values from sentences.
The TF-IDF formula is the same one as scikit-learn. The only difference is the fact that the document frequencies are determined online, whereas in a batch setting they can be determined by performing an initial pass through the data.
Note that the parameters are identical to those of feature_extraction.BagOfWords
.
Parameters¶
-
normalize – defaults to
True
Whether or not the TF-IDF values by their L2 norm.
-
on ('str') – defaults to
None
The name of the feature that contains the text to vectorize. If
None
, then the input is treated as a document instead of a set of features. -
strip_accents – defaults to
True
Whether or not to strip accent characters.
-
lowercase – defaults to
True
Whether or not to convert all characters to lowercase.
-
preprocessor ('typing.Callable') – defaults to
None
An optional preprocessing function which overrides the
strip_accents
andlowercase
steps, while preserving the tokenizing and n-grams generation steps. -
tokenizer ('typing.Callable') – defaults to
None
A function used to convert preprocessed text into a
dict
of tokens. By default, a regex formula that works well in most cases is used. -
ngram_range – defaults to
(1, 1)
The lower and upper boundary of the range n-grams to be extracted. All values of n such that
min_n <= n <= max_n
will be used. For example anngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. Only works iftokenizer
is not set toFalse
.
Attributes¶
-
dfs (collections.defaultdict))
Document counts.
-
n (int)
Number of scanned documents.
Examples¶
>>> from river import feature_extraction
>>> tfidf = feature_extraction.TFIDF()
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> for sentence in corpus:
... tfidf = tfidf.learn_one(sentence)
... print(tfidf.transform_one(sentence))
{'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
{'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
{'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
{'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}
In the above example, a string is passed to transform_one
. You can also indicate which
field to access if the string is stored in a dictionary:
>>> tfidf = feature_extraction.TFIDF(on='sentence')
>>> for sentence in corpus:
... x = {'sentence': sentence}
... tfidf = tfidf.learn_one(x)
... print(tfidf.transform_one(x))
{'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
{'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
{'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
{'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}
Methods¶
learn_many
learn_one
Update with a set of features x
.
A lot of transformers don't actually have to do anything during the learn_one
step because they are stateless. For this reason the default behavior of this function is to do nothing. Transformers that however do something during the learn_one
can override this method.
Parameters
- x (dict)
Returns
Transformer: self
process_text
transform_many
Transform pandas series of string into term-frequency pandas sparse dataframe.
Parameters
- X ('pd.Series')
transform_one
Transform a set of features x
.
Parameters
- x (dict)
Returns
dict: The transformed values.