Skip to contents

Splits text into several tokens using specified tokenizer.

Usage

strj_tokenize(
  text,
  format = c("list", "data.frame"),
  engine = c("stringi", "budoux", "tinyseg", "mecab", "sudachipy"),
  rcpath = NULL,
  mode = c("C", "B", "A"),
  split = FALSE
)

Arguments

text

Character vector to be tokenized.

format

Output format. Choose list or data.frame.

engine

Tokenizer name. Choose one of 'stringi', 'budoux', 'tinyseg', 'mecab', or 'sudachipy'. Note that the specified tokenizer is installed and available when you use 'mecab' or 'sudachipy'.

rcpath

Path to a setting file for 'MeCab' or 'sudachipy' if any.

mode

Splitting mode for 'sudachipy'.

split

Logical. If passed as TRUE, the function splits the vector into some sentences using stringi::stri_split_boundaries(type = "sentence") before tokenizing.

Value

A list or a data.frame.

Examples

strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  )
)
#> $`1`
#> [1] "あの"           "イーハトーヴォ" "の"             "すき"          
#> [5] "と"             "おっ"           "た"             "風"            
#> 
strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  ),
  format = "data.frame"
)
#>   doc_id          token
#> 1      1           あの
#> 2      1 イーハトーヴォ
#> 3      1             の
#> 4      1           すき
#> 5      1             と
#> 6      1           おっ
#> 7      1             た
#> 8      1             風