Splits text into several tokens using specified tokenizer.
Arguments
- text
Character vector to be tokenized.
- format
Output format. Choose
list
ordata.frame
.- engine
Tokenizer name. Choose one of 'stringi', 'budoux', 'tinyseg', 'mecab', or 'sudachipy'. Note that the specified tokenizer is installed and available when you use 'mecab' or 'sudachipy'.
- rcpath
Path to a setting file for 'MeCab' or 'sudachipy' if any.
- mode
Splitting mode for 'sudachipy'.
- split
Logical. If passed as
TRUE
, the function splits the vector into some sentences usingstringi::stri_split_boundaries(type = "sentence")
before tokenizing.
Examples
strj_tokenize(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
)
)
#> $`1`
#> [1] "あの" "イーハトーヴォ" "の" "すき"
#> [5] "と" "おっ" "た" "風"
#>
strj_tokenize(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
),
format = "data.frame"
)
#> doc_id token
#> 1 1 あの
#> 2 1 イーハトーヴォ
#> 3 1 の
#> 4 1 すき
#> 5 1 と
#> 6 1 おっ
#> 7 1 た
#> 8 1 風