[Python-Dev] UCS2/UCS4 default

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 19:01:30 CEST 2008

-On [20080703 18:45], James Y Knight (foom at fuhm.net) wrote:
>I think this is misguided.

Only trying to at least correct the current situation, which I consider a
bit of a mess, personally. (Although it seems others share my view.)

>I'd like to have 3 levels of access available:
>1) "byte"-level. In a new implementation I'd probably choose to make  
>all my strings stored in UTF-8, but UTF-16 is fine too.
>2) codepoint-level.
>3) grapheme-level.

Sounds interesting as well and I can very much see the advantages of such
levels and their methods. Especially in the i18n/l10n work I do.

>You should be able to iterate over the string at any of the levels,  
>ask for the nearest codepoint/grapheme boundary to the left or right  
>of an index at a different level, etc.


Actually it seems Java already has a lot of similar methods.

>There are a few more desirable operations, to manipulate strings at  
>the grapheme level (because unlike for UTF-8/UTF-16 codepoints,  
>graphemes don't have the nice property of not containing prefixes  
>which are themselves valid graphemes). So, you want a find (and  
>everything else that implicitly does a find operation, like split,  
>replace, strip, etc) which requires that both endpoints of its match  
>are on a grapheme-boundary. [[Probably the easiest way to implement  
>this would be in the regexp engine.]]

Well, your ideas and seeing Java's stuff actually got me excited to work on
these kind of ideas, next to my datetime revamp.

What would the chances for inclusion in Python be if such a PEP + code would
be presented Guido?

Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Beware of the fury of the patient man...

More information about the Python-Dev mailing list