An alias of strj_tokenize(engine = "tinyseg")
.
Usage
strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)
Arguments
- text
Character vector to be tokenized.
- format
Output format. Choose
list
ordata.frame
.- split
Logical. If passed as
TRUE
, the function splits vectors into some sentences usingstringi::stri_split_boundaries(type = "sentence")
before tokenizing.
Examples
strj_tinyseg(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
)
)
#> $`1`
#> [1] "あ" "の" "イーハトーヴォ" "の"
#> [5] "すき" "と" "おっ" "た"
#> [9] "風"
#>
strj_tinyseg(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
),
format = "data.frame"
)
#> doc_id token
#> 1 1 あ
#> 2 1 の
#> 3 1 イーハトーヴォ
#> 4 1 の
#> 5 1 すき
#> 6 1 と
#> 7 1 おっ
#> 8 1 た
#> 9 1 風