Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 4 types of term frequencies and 5 types of inverse document frequencies.
Arguments
- tbl
A tidy text dataset.
- term
<
data-masked> Column containing terms.- document
<
data-masked> Column containing document IDs.- n
<
data-masked> Column containing document-term counts.- tf
Method for computing term frequency.
- idf
Method for computing inverse document frequency.
- norm
Logical; If passed as
TRUE, TF-IDF values are normalized being divided with L2 norms.- rmecab_compat
Logical; If passed as
TRUE, computes values while taking care of compatibility with 'RMeCab'. Note that 'RMeCab' always computes IDF values using term frequency rather than raw term counts, and thus TF-IDF values may be doubly affected by term frequency.
Details
Types of term frequency can be switched with tf argument:
tfis term frequency (not raw count of terms).tf2is logarithmic term frequency of which base isexp(1).tf3is binary-weighted term frequency.itfis inverse term frequency. Use withidf="df".
Types of inverse document frequencies can be switched with idf argument:
idfis inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw values after logarithmizing.idf2is global frequency IDF.idf3is probabilistic IDF of which base is 2.idf4is global entropy, not IDF in actual.dfis document frequency. Use withtf="itf".
Examples
# \donttest{
df <- dplyr::count(hiroba, doc_id, token)
bind_tf_idf2(df) |>
head()
#> doc_id token n tf idf tf_idf
#> 1 1 の 1 0.3333333 2.011277 0.6704258
#> 2 1 ポラーノ 1 0.3333333 5.602724 1.8675746
#> 3 1 広場 1 0.3333333 4.979287 1.6597624
#> 4 2 宮沢 1 0.5000000 9.812177 4.9060887
#> 5 2 賢治 1 0.5000000 9.812177 4.9060887
#> 6 3 レオーノ・キュースト 1 0.1428571 8.490249 1.2128927
# }
