Builds a UTF-8 user dictionary from a csv file.
Arguments
- dic_dir
Directory where the source dictionaries are located. This argument is passed as '-d' option argument.
- file
Path to write the user dictionary. This argument is passed as '-u' option argument.
- csv_file
Path to an input csv file.
- encoding
Encoding of input csv files. This argument is passed as '-f' option argument.
Details
This function is a wrapper around dictionary compiler of 'MeCab'.
Note that this function does not support auto assignment of word cost field.
So, you can't leave any word costs as empty in your input csv file.
To estimate word costs, use posDebugRcpp()
function.
Examples
# \donttest{
if (requireNamespace("withr")) {
# create a sample dictionary in temporary directory
build_sys_dic(
dic_dir = system.file("latin", package = "gibasa"),
out_dir = tempdir(),
encoding = "utf8"
)
# copy the 'dicrc' file
file.copy(
system.file("latin/dicrc", package = "gibasa"),
tempdir()
)
# write a csv file and compile it into a user dictionary
csv_file <- tempfile(fileext = ".csv")
writeLines(
c(
"qa, 0, 0, 5, \u304f\u3041",
"qi, 0, 0, 5, \u304f\u3043",
"qu, 0, 0, 5, \u304f",
"qe, 0, 0, 5, \u304f\u3047",
"qo, 0, 0, 5, \u304f\u3049"
),
csv_file
)
build_user_dic(
dic_dir = tempdir(),
file = (user_dic <- tempfile(fileext = ".dic")),
csv_file = csv_file,
encoding = "utf8"
)
# mocking a 'mecabrc' file to temporarily use the dictionary
withr::with_envvar(
c(
"MECABRC" = if (.Platform$OS.type == "windows") {
"nul"
} else {
"/dev/null"
},
"RCPP_PARALLEL_BACKEND" = "tinythread"
),
{
tokenize("quensan", sys_dic = tempdir(), user_dic = user_dic)
}
)
}
#> reading /tmp/RtmpfQqVIM/temp_libpath379a7f7fb937/gibasa/latin/unk.def ... 2
#> reading /tmp/RtmpfQqVIM/temp_libpath379a7f7fb937/gibasa/latin/dic.csv ... 450
#> reading /tmp/RtmpfQqVIM/temp_libpath379a7f7fb937/gibasa/latin/matrix.def ... 1x1
#>
#> done!
#> reading /tmp/RtmpfQqVIM/file379a5be86ff9.csv ... 5
#>
#> done!
#> # A tibble: 5 × 5
#> doc_id sentence_id token_id token feature
#> <fct> <int> <int> <chr> <chr>
#> 1 1 1 1 qu く
#> 2 1 1 2 e え
#> 3 1 1 3 n ん
#> 4 1 1 4 sa さ
#> 5 1 1 5 n ん
# }