Skip to contents

Overview

ldccr is utilities for various Japanese corpora.

The goal of ldccr package is to make easy to use Japanese language resources.

This package provides:

  1. parsers for several Japanese corpora that are free or open licensed (non proprietary).
  2. a downloader of zipped text files published on Aozora Bunko.

Installation

install.packages("ldccr", repos = c("https://paithiov909.r-universe.dev", "https://cloud.r-project.org"))

Supported Corpora

Monolingual

Name License Link
:heavy_check_mark: Live Door News Corpus CC BY-ND 2.1 JP #
:heavy_check_mark: Japanese Realistic Textual Entailment Corpus CC BY-NC-SA 4.0 #
:heavy_check_mark: ja.text8 corpus CC BY-SA #

Multilingual

Currently not supported.

Download text file from Aozora Bunko

if (!dir.exists("cache")) dir.create("cache")

text <- ldccr::AozoraBunkoSnapshot |>
  dplyr::sample_n(1L) |>
  dplyr::pull("テキストファイルURL") |>
  ldccr::read_aozora(directory = "cache") |>
  readr::read_lines()

dplyr::glimpse(text)
#>  chr [1:16] "雪子さんの泥棒よけ" "夢野久作" ...

License

MIT license.