Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
Arguments
- tbl
A data.frame of tokens.
- pull
<
data-masked
> Column to be packed into text or ngrams body. Default value istoken
.- n
Integer internally passed to ngrams tokenizer function created of
audubon::ngram_tokenizer()
- sep
Character scalar internally used as the concatenator of ngrams.
- .collapse
This argument is passed to
stringi::stri_c()
.
Text Interchange Formats (TIF)
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
Valid data.frame of tokens
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
Examples
pack(strj_tokenize(polano[1:5], format = "data.frame"))
#> # A tibble: 5 × 2
#> doc_id text
#> <fct> <chr>
#> 1 1 ポラーノ の 広場
#> 2 2 宮沢 賢治
#> 3 3 前 十七 等 官 レ オー ノ ・ キュー スト 誌
#> 4 4 宮沢 賢治 訳述
#> 5 5 その ころ わたくし は 、 モリーオ 市 の 博物 局 に 勤め て 居 り ま し…