Skip to contents

Tokenize sentences using 'vibrato'

Usage

tokenize(
  x,
  text_field = "text",
  docid_field = "doc_id",
  sys_dic = "",
  user_dic = "",
  split = FALSE,
  mode = c("parse", "wakati"),
  max_grouping_len = 0L,
  verbose = FALSE
)

Arguments

x

A data.frame like object or a character vector to be tokenized.

text_field

<data-masked> String or symbol; column containing texts to be tokenized.

docid_field

<data-masked> String or symbol; column containing document IDs.

sys_dic

Character scalar; path to the system dictionary for 'vibrato'.

user_dic

Character scalar; path to the user dictionary for 'vibrato'.

split

split Logical. When passed as TRUE, the function internally splits the sentences into sub-sentences

mode

Character scalar to switch output format.

max_grouping_len

Integer scalar; The maximum grouping length for unknown words. The default value is 0L, indicating the infinity length.

verbose

Logical. If TRUE, returns additional information for debugging.

Value

A tibble or a named list of tokens.