Compute the stroke edit distances between two sets of kanji
Source:R/stroke_edit_distance.R
sedist.Rd
Variants of the stroke edit distance proposed by Yencken (2010). Each kanji is encoded as sequence of stroke types according to its stroke order, using the type attribute from the kanjiVG data. Then the edit distance (a.k.a.\ Levenshtein distance) between sequences is computed and divided by the maximum of the number of strokes
Usage
sedist(k1, k2, type = c("full", "before_slash", "first"))
Arguments
- k1, k2
atomic vectors or lists of kanji in any format that can be treated by
convert_kanji()
- type
the type of stroke edit distance to compute. See details.
Details
The kanjiVG type attribute is a single string composed of a CJK strokes Unicode character, an optional
latin letter providing further information and possibly a variant (another CJK strokes character with optional
letter) separated by "/". If type
is "full"` a match is only counted if two strings are exactly the
same, "before_slash" ignores any slashes and what comes after them, "first" only considers the first
character of each string (so the first CJK stroke character) when counting matches.
The stroke edit distance used by Yencken (2010) is obtained by setting type = "all" (the default), except that the underlying kanjiVG data has significantly changed since then. Comparing with the values in dstrokedit we get an agreement of 96.3 percent, whereas the other distances disagree by a small amount (usually 1-2 edit operations).
References
Yencken, Lars (2010). Orthographic support for passing the reading hurdle in Japanese.
PhD Thesis, University of Melbourne, Australia
Examples
ind1 <- 384
k1 <- convert_kanji(ind1, "character")
ind2 <- which(dstrokedit[ind1,] > 0)
# dstrokedit contains only the "closest" kanji
k2 <- convert_kanji(ind2, "character")
row_a <- dstrokedit[ind1, ind2]
if (requireNamespace("kanjistat.data", quietly = TRUE)) {
row_b <- sedist(k1, k2)
mat <- rbind(row_a, row_b)
rownames(mat) = c(k1, k1)
colnames(mat) = k2
mat
}
#> 場 湯 易 傷 腸 陣 掲 揚 隅
#> 陽 0.25 0.25 0.3333330 0.3076920 0.3076920 0.4166670 0.4166670 0.25 0.4166670
#> 陽 0.25 0.25 0.3333333 0.3076923 0.3076923 0.4166667 0.4166667 0.25 0.4166667
#> 陳 渇
#> 陽 0.4166670 0.4166670
#> 陽 0.4166667 0.4166667