Overview
ldccr is utilities for various Japanese corpora.
The goal of ldccr package is to make easy to use Japanese language resources.
This package provides:
- parsers for several Japanese corpora that are free or open licensed (non proprietary).
- a downloader of zipped text files published on Aozora Bunko.
Installation
install.packages("ldccr", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))
Supported Corpora
Monolingual
… | Name | License | Link |
---|---|---|---|
:heavy_check_mark: | Live Door News Corpus | CC BY-ND 2.1 JP | # |
:heavy_check_mark: | Japanese Realistic Textual Entailment Corpus | CC BY-NC-SA 4.0 | # |
:heavy_check_mark: | ja.text8 corpus | CC BY-SA | # |
Download text file from Aozora Bunko
if (!dir.exists("cache")) dir.create("cache")
text <- ldccr::AozoraBunkoSnapshot |>
dplyr::sample_n(1L) |>
dplyr::pull("テキストファイルURL") |>
ldccr::read_aozora(directory = "cache") |>
readr::read_lines()
dplyr::glimpse(text)
#> chr [1:16] "雪子さんの泥棒よけ" "夢野久作" ...