Tokenize sentences using 'vibrato'
Usage
tokenize(
x,
text_field = "text",
docid_field = "doc_id",
sys_dic = "",
user_dic = "",
split = FALSE,
mode = c("parse", "wakati"),
max_grouping_len = 0L,
verbose = FALSE
)
Arguments
- x
A data.frame like object or a character vector to be tokenized.
- text_field
<
data-masked
> String or symbol; column containing texts to be tokenized.- docid_field
<
data-masked
> String or symbol; column containing document IDs.- sys_dic
Character scalar; path to the system dictionary for 'vibrato'.
- user_dic
Character scalar; path to the user dictionary for 'vibrato'.
- split
split Logical. When passed as
TRUE
, the function internally splits the sentences into sub-sentences- mode
Character scalar to switch output format.
- max_grouping_len
Integer scalar; The maximum grouping length for unknown words. The default value is
0L
, indicating the infinity length.- verbose
Logical. If
TRUE
, returns additional information for debugging.