Utilities for Various Japanese Corpora • ldccr

Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

parsers for several Japanese corpora that are free or open licensed (non proprietary).
a downloader of zipped text files published on Aozora Bunko.

Installation

install.packages("ldccr", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Supported Corpora

Monolingual

…	Name	License	Link
:heavy_check_mark:	Live Door News Corpus	CC BY-ND 2.1 JP	#
:heavy_check_mark:	Japanese Realistic Textual Entailment Corpus	CC BY-NC-SA 4.0	#
:heavy_check_mark:	ja.text8 corpus	CC BY-SA	#

Multilingual

Currently not supported.

Download text file from Aozora Bunko

if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>
  dplyr::sample_n(1L) |>
  dplyr::pull("テキストファイルURL") |>
  ldccr::read_aozora(directory = "cache") |>
  readr::read_lines()

dplyr::glimpse(text)
#>  chr [1:16] "雪子さんの泥棒よけ" "夢野久作" ...

License

MIT license.