Skip to contents

An R wrapper for ‘vibrato’: Viterbi-based accelerated tokenizer

Installation

To install from source package, the Rust toolchain is required.

install.packages("vibrrt", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Usage

You can download the model files from ryan-minato/vibrato-models using hfhub package.

Functions are designed in the same fashion as in the gibasa package. Check the README of the gibasa package for more detailed usage.

sample_text <- jsonlite::read_json(
  "https://paithiov909.r-universe.dev/gibasa/data/ginga/json",
  simplifyVector = TRUE
)

# withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
ipadic <- hfhub::hub_download("ryan-minato/vibrato-models", "ipadic-mecab-2_7_0/system.dic")
# })

vibrrt::tokenize(sample_text[5:8], sys_dic = ipadic)
#> # A tibble: 187 × 5
#>    doc_id sentence_id token_id token        feature                             
#>    <fct>        <dbl>    <dbl> <chr>        <chr>                               
#>  1 1                1        1             記号,空白,*,*,*,*, , ,           
#>  2 1                1        2 カムパネルラ 名詞,一般,*,*,*,*,*                 
#>  3 1                1        3 が           助詞,格助詞,一般,*,*,*,が,ガ,ガ     
#>  4 1                1        4 手           名詞,一般,*,*,*,*,手,テ,テ          
#>  5 1                1        5 を           助詞,格助詞,一般,*,*,*,を,ヲ,ヲ     
#>  6 1                1        6 あげ         動詞,自立,*,*,一段,連用形,あげる,アゲ,アゲ……
#>  7 1                1        7 まし         助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ……
#>  8 1                1        8 た           助動詞,*,*,*,特殊・タ,基本形,た,タ,タ……
#>  9 1                1        9 。           記号,句点,*,*,*,*,。,。,。          
#> 10 1                1       10 それ         名詞,代名詞,一般,*,*,*,それ,ソレ,ソレ……
#> # ℹ 177 more rows