Pure python implementation of string-like class
Xavier Morel
xavier.morel at masklinn.net
Sat Feb 25 15:00:20 EST 2006
Akihiro KAYAMA wrote:
> Sorry for my terrible English. I am living in Japan, and we have a
> large number of characters called Kanji. UTF-16(U+0000...U+10FFFF) is
> enough for practical use in this country also, but for academic
> purpose, I need a large codespace over 20-bits. I wish I could use
> unicode's private space (U+60000000...U+7FFFFFFF) in Python.
>
> -- kayama
I think the Kanji are part of the Han script as far as Unicode is
concerned, you should check it (CJK unified ideograms and CJK unified
ideograms extension A), they may not all be there, but the 27502
characters from these two tables should be enough for most uses.
Oh, by the way, the Unicode code space only goes up to 10FFFF, while
UCS-4's encoding allows code values up to and including 7FFFFFFF the
upper Unicode private space is Plane Sixteen (100000–10FFFF), the other
private spaces being a part of the Basic Multilingual Plane
(U+E000–U+F8FF) and Plane Fifteen (U+F0000–U+FFFFF) and even UTF-32
doesn't go beyond 10FFFF.
Since the Dai Kan-Wa jiten "only" lists about 50,000 kanji (even though
it probably isn't perfectly complete) it fits with ease in both plane
fifteen and sixteen (65535 code points each).
More information about the Python-list
mailing list