Pack a data.frame of tokens

Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

tbl: A data.frame of tokens.
pull: <data-masked> Column to be packed into text or ngrams body. Default value is token.
n: Integer internally passed to ngrams tokenizer function created of tangela::ngram_tokenizer()
sep: Character scalar internally used as the concatenator of ngrams.
.collapse: This argument is passed to stringi::stri_c().

Value

A tibble.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.

Usage

Arguments

Value

Text Interchange Formats (TIF)

Valid data.frame of tokens

See also