Split text into tokens — strj_tokenize • audubon

Splits text into several tokens using specified tokenizer.

Usage

strj_tokenize(
  text,
  format = c("list", "data.frame"),
  engine = c("stringi", "budoux", "tinyseg", "mecab", "sudachipy"),
  rcpath = NULL,
  mode = c("C", "B", "A"),
  split = FALSE
)

Arguments

text: Character vector to be tokenized.
format: Output format. Choose list or data.frame.
engine: Tokenizer name. Choose one of 'stringi', 'budoux', 'tinyseg', 'mecab', or 'sudachipy'. Note that the specified tokenizer is installed and available when you use 'mecab' or 'sudachipy'.
rcpath: Path to a setting file for 'MeCab' or 'sudachipy' if any.
mode: Splitting mode for 'sudachipy'.
split: Logical. If passed as TRUE, the function splits the vector into some sentences using stringi::stri_split_boundaries(type = "sentence") before tokenizing.

Value

A list or a data.frame.

Examples

strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  )
)
#> $`1`
#> [1] "あの"           "イーハトーヴォ" "の"             "すき"          
#> [5] "と"             "おっ"           "た"             "風"            
#> 
strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  ),
  format = "data.frame"
)
#>   doc_id          token
#> 1      1           あの
#> 2      1 イーハトーヴォ
#> 3      1             の
#> 4      1           すき
#> 5      1             と
#> 6      1           おっ
#> 7      1             た
#> 8      1             風