Skip to contents

Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

tbl

A data.frame of tokens.

pull

<data-masked> Column to be packed into text or ngrams body. Default value is token.

n

Integer internally passed to ngrams tokenizer function created of audubon::ngram_tokenizer()

sep

Character scalar internally used as the concatenator of ngrams.

.collapse

This argument is passed to stringi::stri_c().

Value

A tibble.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.

Examples

pack(strj_tokenize(polano[1:5], format = "data.frame"))
#> # A tibble: 5 × 2
#>   doc_id text                                                                   
#>   <fct>  <chr>                                                                  
#> 1 1      ポラーノ の 広場                                                       
#> 2 2      宮沢 賢治                                                              
#> 3 3      前 十七 等 官 レ オー ノ ・ キュー スト 誌                             
#> 4 4      宮沢 賢治 訳述                                                         
#> 5 5      その ころ わたくし は 、 モリーオ 市 の 博物 局 に 勤め て 居 り ま し…