Skip to contents

The lexical density is the proportion of content words (lexical items) in documents. This function is a simple helper for calculating the lexical density of given datasets.

Usage

lex_density(vec, contents_words, targets = NULL, negate = c(FALSE, FALSE))

Arguments

vec

A character vector.

contents_words

A character vector containing values to be counted as contents words.

targets

A character vector with which the denominator of lexical density is filtered before computing values.

negate

A logical vector of which length is 2. If passed as TRUE, then respectively negates the predicate functions for counting contents words or targets.

Value

A numeric vector.

Examples

if (FALSE) {
df <- tokenize(
  data.frame(
    doc_id = seq_along(ginga[5:8]),
    text = ginga[5:8]
  )
)
df |>
  prettify(col_select = "POS1") |>
  dplyr::group_by(doc_id) |>
  dplyr::summarise(
    noun_ratio = lex_density(POS1,
      "\u540d\u8a5e",
      c("\u52a9\u8a5e", "\u52a9\u52d5\u8a5e"),
      negate = c(FALSE, TRUE)
    ),
    mvr = lex_density(
      POS1,
      c("\u5f62\u5bb9\u8a5e", "\u526f\u8a5e", "\u9023\u4f53\u8a5e"),
      "\u52d5\u8a5e"
    ),
    vnr = lex_density(POS1, "\u52d5\u8a5e", "\u540d\u8a5e")
  )
}