Pack prettified data.frame of tokens

Packs a prettified data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

tbl: A prettified data.frame of tokens.
pull: Column to be packed into text or ngrams body. Default value is token.
n: Integer internally passed to ngrams tokenizer function created of audubon::ngram_tokenizer()
sep: Character scalar internally used as the concatenator of ngrams.
.collapse: This argument is passed to stringi::stri_join().

Value

A data.frame.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The prettified data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.