Tokenize sentences using 'Vibrato'
Usage
tokenize(
x,
text_field = "text",
docid_field = "doc_id",
sys_dic = "",
user_dic = "",
split = FALSE,
mode = c("parse", "wakati")
)
Arguments
- x
A data.frame like object or a character vector to be tokenized.
- text_field
<
data-masked
> String or symbol; column containing texts to be tokenized.- docid_field
<
data-masked
> String or symbol; column containing document IDs.- sys_dic
Character scalar; path to the system dictionary for 'Vibrato'.
- user_dic
Character scalar; path to the user dictionary for 'Vibrato'.
- split
split Logical. When passed as
TRUE
, the function internally splits the sentences into sub-sentences- mode
Character scalar to switch output format.