Rewrites Japanese text according to a set of normalization rules modeled after Sudachi dictionary definitions.
Usage
strj_rewrite_as_def(text, as = read_rewrite_def())Arguments
- text
A character vector containing Japanese text.
- as
A rewrite definition object as returned by
read_rewrite_def().
Details
This function applies character-level rewrite rules to normalize variant forms while optionally ignoring specified characters. The implementation is a simplified and heuristic adaptation of Sudachi-style normalization.
The rewrite process is based on fixed replacement rules and does not aim to fully reproduce Sudachi's normalization behavior.
Examples
strj_rewrite_as_def(
paste0(
"\u2015\u2015\u5357\u30a2\u30eb",
"\u30d7\u30b9\u306e\u3000\u5929",
"\u7136\u6c34-\u3000\uff33\uff50",
"\uff41\uff52\uff4b\uff49\uff4e\uff47*",
"\u3000\uff2c\uff45\uff4d\uff4f\uff4e+",
"\u3000\u30ec\u30e2\u30f3\u4e00\u7d5e\u308a"
)
)
#> [1] "――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り"
strj_rewrite_as_def(
"\u60e1\u3068\u5047\u9762\u306e\u30eb\u30fc\u30eb",
read_rewrite_def(system.file("def/kyuji.def", package = "audubon"))
)
#> [1] "悪と仮面のルール"
