Packs a prettified data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
Arguments
- tbl
A prettified data.frame of tokens.
- pull
Column to be packed into text or ngrams body. Default value is `token`.
- n
Integer internally passed to ngrams tokenizer function created of
audubon::ngram_tokenizer()
- sep
Character scalar internally used as the concatenator of ngrams.
- .collapse
This argument is passed to
stringi::stri_join()
.
Text Interchange Formats (TIF)
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
Valid data.frame of tokens
The prettified data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens are expected to have one unique key column (named `doc_id`) of each text and several feature columns of each tokens. The feature columns must contain at least `token` itself.