Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 4 types of term frequencies and 5 types of inverse document frequencies.
Arguments
- tbl
A tidy text dataset.
- term
<
data-masked
> Column containing terms.- document
<
data-masked
> Column containing document IDs.- n
<
data-masked
> Column containing document-term counts.- tf
Method for computing term frequency.
- idf
Method for computing inverse document frequency.
- norm
Logical; If passed as
TRUE
, TF-IDF values are normalized being divided with L2 norms.- rmecab_compat
Logical; If passed as
TRUE
, computes values while taking care of compatibility with 'RMeCab'. Note that 'RMeCab' always computes IDF values using term frequency rather than raw term counts, and thus TF-IDF values may be doubly affected by term frequency.
Details
Types of term frequency can be switched with tf
argument:
tf
is term frequency (not raw count of terms).tf2
is logarithmic term frequency of which base isexp(1)
.tf3
is binary-weighted term frequency.itf
is inverse term frequency. Use withidf="df"
.
Types of inverse document frequencies can be switched with idf
argument:
idf
is inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw values after logarithmizing.idf2
is global frequency IDF.idf3
is probabilistic IDF of which base is 2.idf4
is global entropy, not IDF in actual.df
is document frequency. Use withtf="itf"
.