Pure python implementation of string-like class

Xavier Morel xavier.morel at masklinn.net
Sat Feb 25 15:00:20 EST 2006


Akihiro KAYAMA wrote:
> Sorry for my terrible English. I am living in Japan, and we have a
> large number of characters called Kanji. UTF-16(U+0000...U+10FFFF) is
> enough for practical use in this country also, but for academic
> purpose, I need a large codespace over 20-bits. I wish I could use
> unicode's private space (U+60000000...U+7FFFFFFF) in Python.
> 
> -- kayama

I think the Kanji are part of the Han script as far as Unicode is 
concerned, you should check it (CJK unified ideograms and CJK unified 
ideograms extension A), they may not all be there, but the 27502 
characters from these two tables should be enough for most uses.

Oh, by the way, the Unicode code space only goes up to 10FFFF, while 
UCS-4's encoding allows code values up to and including 7FFFFFFF the 
upper Unicode private space is Plane Sixteen (100000–10FFFF), the other 
private spaces being a part of the Basic Multilingual Plane 
(U+E000–U+F8FF) and Plane Fifteen (U+F0000–U+FFFFF) and even UTF-32 
doesn't go beyond 10FFFF.

Since the Dai Kan-Wa jiten "only" lists about 50,000 kanji (even though 
it probably isn't perfectly complete) it fits with ease in both plane 
fifteen and sixteen (65535 code points each).



More information about the Python-list mailing list