Skip to contents

Precomputed kanji distances

Usage

dstrokedit

dyehli

Format

Symmetric sparse matrices containing distances between a key kanji, its ten nearest neighbors and possibly some other close kanji.
For dstrokedit, these are the stroke edit distances according to Yencken and Baldwin (2008).
For dyehli, these are the bag-of-radicals distances according to Yeh and Li (2002).
Both are an instance of the S4 class dsCMatrix (symmetric sparse matrices in column-compressed format) with 2133 rows and 2133 columns.

All pre-2010 jouyou kanji that are also post-2010 jouyou kanji are included. The indices are those from kbase.

Source

Datasets from https://lars.yencken.org/datasets, made available under the Creative Commons Attribution 3.0 Unported licence.

Computed as part of Yencken, Lars (2010) Orthographic support for passing the reading hurdle in Japanese. PhD Thesis, University of Melbourne, Melbourne, Australia.

References

Yeh, Su-Ling and Li, Jing-Ling (2002). Role of structure and component in judgements of visual similarity of Chinese characters. Journal of Experimental Psychology: Human Perception and Performance, 28(4), 933–947.

Yencken, Lars, & Baldwin, Timothy (2008). Measuring and predicting orthographic associations: Modelling the similarity of Japanese kanji. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1041-1048.

Examples

# Find index for kanji 部
bu_index <- match("部", kbase$kanji)

# Look up available stroke edit distances for 部.
non_zero <- which(dstrokedit[bu_index,] != 0)
sed <- dstrokedit[non_zero, bu_index]
names(sed) <- kbase[non_zero,]$kanji
sort(sed)
#>       郡       剖       度       章       席       痘       適       庫 
#> 0.272727 0.272727 0.363636 0.363636 0.363636 0.416667 0.428571 0.454545 
#>       常       竜       郊       歯       蛮       郭 
#> 0.454545 0.454545 0.454545 0.500000 0.500000 0.545455 

# Look up available bag-of-radicals distances for 部.
non_zero <- which(dyehli[bu_index,] != 0)
bord <- dyehli[non_zero, bu_index]
names(bord) <- kbase[non_zero,]$kanji
sort(bord)
#>       陪       倍       剖       暗       位       泣       競       識 
#> 0.000001 0.167000 0.167000 0.270000 0.270000 0.270000 0.278000 0.278000 
#>       培       賠       辞       障       韻       接       粒       敵 
#> 0.278000 0.278000 0.320000 0.320000 0.320000 0.333000 0.333000 0.355000 
#>       摘       滴       嫡       端       傍       滝       隊       郭 
#> 0.355000 0.355000 0.355000 0.383000 0.423000 0.423000 0.452000 0.452000 
#>       険       邸 
#> 0.500000 0.529000