audubon is Japanese text processing tools for:
- filling Japanese iteration marks
- hiraganization, katakanization and romanization using hakatashi/japanese.js
- segmentation by phrase using google/budoux and ‘TinySegmenter.js’
- text normalization which is based on rules for the ‘Sudachi’ morphological analyzer and the ‘NEologd’ (Neologism dictionary for ‘MeCab’).
Some features above are not implemented in ‘ICU’ (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.
strj_fill_iter_mark repeats the previous character and replaces the iteration marks if the element has more than 5 characters. You can use this feature with
Character class conversion uses hakatashi/japanese.js.
strj_tokenize splits Japanese text into some phrases using google/budoux, TinySegmenter, or other tokenizers.
strj_tokenize("あのイーハトーヴォのすきとおった風", engine = "budoux") #> $`1` #>  "あのイーハトーヴォの" "すきと" "おった" #>  "風"
strj_normalize normalizes text following the rule based on NEologd style.
strj_normalize("――南アルプスの 天然水- Ｓｐａｒｋｉｎｇ* Ｌｅｍｏｎ+ レモン一絞り") #>  "ー南アルプスの天然水-Sparking* Lemon+レモン一絞り"
strj_rewrite_as_def is an R port of SudachiCharNormalizer that typically normalizes characters following a ’*.def’ file.
audubon package contains several ’*.def’ files, so you can use them or write a ‘rewrite.def’ file by yourself as follows.
# single characters will **never** be normalized. …# if two characters are separated with a tab, # left side forms are always rewritten to right side forms # before normalized. 斎 斉 齋 斉 齊 斉# supports rewriting a single character to a single character, # i.e., this cannot work. ｱｯ ア
This feature is more powerful than
stringi::stri_trans_* because it allows users to control which characters are normalized. For instance, this function can be used to convert kyuji-tai characters to shinji-tai characters.