An R wrapper for vibrato: Viterbi-based accelerated tokenizer.
Installation
To install from source package, the Rust toolchain is required.
install.packages("vibrrt", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))
Usage
You can download the model files from ryan-minato/vibrato-models using hfhub package.
sample_text <- jsonlite::read_json(
"https://paithiov909.r-universe.dev/gibasa/data/ginga/json",
simplifyVector = TRUE
)
# withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), {
ipadic <- hfhub::hub_download("ryan-minato/vibrato-models", "ipadic-mecab-2_7_0/system.dic")
# })
vibrrt::tokenize(
sample_text[5:8],
tagger = vibrrt::create_tagger(ipadic)
)
#> # A tibble: 187 × 5
#> doc_id sentence_id token_id token feature
#> <fct> <int> <int> <chr> <chr>
#> 1 1 1 1 記号,空白,*,*,*,*, , ,
#> 2 1 1 2 カムパネルラ 名詞,一般,*,*,*,*,*
#> 3 1 1 3 が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
#> 4 1 1 4 手 名詞,一般,*,*,*,*,手,テ,テ
#> 5 1 1 5 を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
#> 6 1 1 6 あげ 動詞,自立,*,*,一段,連用形,あげる,アゲ,アゲ……
#> 7 1 1 7 まし 助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ……
#> 8 1 1 8 た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ……
#> 9 1 1 9 。 記号,句点,*,*,*,*,。,。,。
#> 10 1 1 10 それ 名詞,代名詞,一般,*,*,*,それ,ソレ,ソレ……
#> # ℹ 177 more rows
Versioning
This package is versioned by copying the version number of vibrato, where the first three digits represent that version number and the fourth digit (if any) represents the patch release for this package.