Skip to contents

An R wrapper for ‘Vibrato’: Viterbi-based accelerated tokenizer

Installation

To install from source package, the Rust toolchain is required.

install.packages("vibrrt", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Usage

You can download the model files from ryan-minato/vibrato-models using hfhub package.

Functions are designed in the same fashion as in the gibasa package. Check the README of the gibasa package for more detailed usage.

sample_text <- jsonlite::read_json(
  "https://paithiov909.r-universe.dev/gibasa/data/ginga/json",
  simplifyVector = TRUE
)

withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
  ipadic <- hfhub::hub_download("ryan-minato/vibrato-models", "ipadic-mecab-2_7_0/system.dic")
})
#> ipadic-mecab-2_7_0/system.dic ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |  48 MB/ 48 MB E…

vibrrt::tokenize(sample_text[5:8], sys_dic = ipadic)
#> # A tibble: 187 × 7
#>    doc_id sentence_id token_id token        feature         word_cost total_cost
#>    <fct>        <dbl>    <dbl> <chr>        <chr>               <int>      <int>
#>  1 1                1        1             記号,空白,*,*,*,*,…      1287        993
#>  2 1                1        2 カムパネルラ 名詞,一般,*,*,*,*,…      9461      10379
#>  3 1                1        3 が           助詞,格助詞,一般,*,*,…      3866       9524
#>  4 1                1        4 手           名詞,一般,*,*,*,*,…      5631      14331
#>  5 1                1        5 を           助詞,格助詞,一般,*,*,…      4183      13521
#>  6 1                1        6 あげ         動詞,自立,*,*,一段,連…      9908      20097
#>  7 1                1        7 まし         助動詞,*,*,*,特殊・マ…      6320      17966
#>  8 1                1        8 た           助動詞,*,*,*,特殊・タ…      5500      17369
#>  9 1                1        9 。           記号,句点,*,*,*,*,…       215      13935
#> 10 1                1       10 それ         名詞,代名詞,一般,*,*,…      4818      18710
#> # ℹ 177 more rows