Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 4 types of term frequencies and 5 types of inverse document frequencies.
Arguments
- tbl
A tidy text dataset.
- term
<
data-masked
> Column containing terms.- document
<
data-masked
> Column containing document IDs.- n
<
data-masked
> Column containing document-term counts.- tf
Method for computing term frequency.
- idf
Method for computing inverse document frequency.
- norm
Logical; If passed as
TRUE
, TF-IDF values are normalized being divided with L2 norms.- rmecab_compat
Logical; If passed as
TRUE
, computes values while taking care of compatibility with 'RMeCab'. Note that 'RMeCab' always computes IDF values using term frequency rather than raw term counts, and thus TF-IDF values may be doubly affected by term frequency.
Details
Types of term frequency can be switched with tf
argument:
tf
is term frequency (not raw count of terms).tf2
is logarithmic term frequency of which base isexp(1)
.tf3
is binary-weighted term frequency.itf
is inverse term frequency. Use withidf="df"
.
Types of inverse document frequencies can be switched with idf
argument:
idf
is inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw values after logarithmizing.idf2
is global frequency IDF.idf3
is probabilistic IDF of which base is 2.idf4
is global entropy, not IDF in actual.df
is document frequency. Use withtf="itf"
.
Examples
# \donttest{
df <- dplyr::count(hiroba, doc_id, token)
bind_tf_idf2(df) |>
head()
#> doc_id token n tf idf tf_idf
#> 1 1 の 1 0.3333333 2.011277 0.6704258
#> 2 1 ポラーノ 1 0.3333333 5.602724 1.8675746
#> 3 1 広場 1 0.3333333 4.979287 1.6597624
#> 4 2 宮沢 1 0.5000000 9.812177 4.9060887
#> 5 2 賢治 1 0.5000000 9.812177 4.9060887
#> 6 3 レオーノ・キュースト 1 0.1428571 8.490249 1.2128927
# }