Builds a UTF-8 system dictionary from source dictionary files.
- dic_dir
Directory where the source dictionaries are located. This argument is passed as '-d' option argument.
- out_dir
Directory where the binary dictionary will be written. This argument is passed as '-o' option argument.
- encoding
Encoding of input csv files. This argument is passed as '-f' option argument.
This function is a wrapper around dictionary compiler of 'MeCab'.
Running this function will create 4 files:
'char.bin', 'matrix.bin', 'sys.dic', and 'unk.dic' in out_dir
To use these compiled dictionary,
you also need create a dicrc
file in out_dir
A dicrc
file is included in source dictionaries,
so you can just copy it to out_dir
# \donttest{
if (requireNamespace("withr")) {
# create a sample dictionary in temporary directory
dic_dir = system.file("latin", package = "gibasa"),
out_dir = tempdir(),
encoding = "utf8"
# copy the 'dicrc' file
system.file("latin/dicrc", package = "gibasa"),
# mocking a 'mecabrc' file to temporarily use the dictionary
"MECABRC" = if (.Platform$OS.type == "windows") {
} else {
"RCPP_PARALLEL_BACKEND" = "tinythread"
tokenize("katta-wokattauresikatta", sys_dic = tempdir())
#> reading /tmp/RtmpEo2O8w/temp_libpath3aff17ad95c7/gibasa/latin/unk.def ... 2
#> reading /tmp/RtmpEo2O8w/temp_libpath3aff17ad95c7/gibasa/latin/dic.csv ... 450
#> reading /tmp/RtmpEo2O8w/temp_libpath3aff17ad95c7/gibasa/latin/matrix.def ... 1x1
#> done!
#> # A tibble: 11 × 5
#> doc_id sentence_id token_id token feature
#> <fct> <int> <int> <chr> <chr>
#> 1 1 1 1 ka か
#> 2 1 1 2 tta った
#> 3 1 1 3 - ー
#> 4 1 1 4 wo を
#> 5 1 1 5 ka か
#> 6 1 1 6 tta った
#> 7 1 1 7 u う
#> 8 1 1 8 re れ
#> 9 1 1 9 si し
#> 10 1 1 10 ka か
#> 11 1 1 11 tta った
# }