Skip to contents

An alias of strj_tokenize(engine = "tinyseg").

Usage

strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)

Arguments

text

Character vector to be tokenized.

format

Output format. Choose list or data.frame.

split

Logical. If passed as TRUE, the function splits vectors into some sentences using stringi::stri_split_boundaries(type = "sentence") before tokenizing.

Value

A list or a data.frame.

Examples

strj_tinyseg(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  )
)
#> $`1`
#> [1] "あ"             "の"             "イーハトーヴォ" "の"            
#> [5] "すき"           "と"             "おっ"           "た"            
#> [9] "風"            
#> 
strj_tinyseg(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  ),
  format = "data.frame"
)
#>   doc_id          token
#> 1      1             あ
#> 2      1             の
#> 3      1 イーハトーヴォ
#> 4      1             の
#> 5      1           すき
#> 6      1             と
#> 7      1           おっ
#> 8      1             た
#> 9      1             風