Skip to contents

Tokenize sentences using 'Vibrato'

Usage

tokenize(
  x,
  text_field = "text",
  docid_field = "doc_id",
  sys_dic = "",
  user_dic = "",
  split = FALSE,
  mode = c("parse", "wakati")
)

Arguments

x

A data.frame like object or a character vector to be tokenized.

text_field

<data-masked> String or symbol; column containing texts to be tokenized.

docid_field

<data-masked> String or symbol; column containing document IDs.

sys_dic

Character scalar; path to the system dictionary for 'Vibrato'.

user_dic

Character scalar; path to the user dictionary for 'Vibrato'.

split

split Logical. When passed as TRUE, the function internally splits the sentences into sub-sentences

mode

Character scalar to switch output format.

Value

A tibble or a named list of tokens.