[Python-3000] How will unicode get used?

Thu Sep 21 12:50:30 CEST 2006

Guido van Rossum wrote:
> On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
>> I wrote:
>>>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>>>> msg[35:-18]
>> u'"\U00010143"'
>>>>> greek_five = msg[36:-19]
>>>>> len(greek_five)
>> 2
>>
>>
>> After posting, I realized that it's worse than that. I suspect that if
>> I tried this on a CPython compiled with wide characters, then
>> len(greek_five) would be 1.
>>
>> What should it be? 2? 1? Implementation-dependent?
> 
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices. 

while i understand the constraints, i think it's not a good decision to 
leave this to be implementation-dependent.

the strings seem to me as such a basic functionality, that it's 
behaviour should not depend on the platform.

for example, how is an application developer then supposed to write 
their applications?

should he write his own slicing/whatever functions to get consistent 
behaviour on linux/windows?

i think this is not just a 'theoretical' issue. it's a very practical 
issue. the only reason why it does not seem to be important, because 
currently not much of the non-16-bit unicode characters are used.

(and this situation seems to be quite similar to that one, when only 
8byte-characters were used :-)

btw. an idea:

==============
maybe this 'problem' should be separated into 2 issues:

1. representation of the unicode string (utf-16 or utf-32)
2. behaviour of the unicode strings in python-3000

of course there are some dependencies between them. (mostly the 
performance of #2)

so why don't we make the *behaviour* cross-platform, and the 
*performance characteristics* and the *representation* platform-dependent?

(means that jython/ironpython could use utf-16, but would slice strings 
slower (because of the surrogate-issues))
================

> Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).
> 

i don't see why there should be the only choice utf-16. it's the 
obvious/most-convenient choice for jython/ironpython, that's correct. 
but (correct me if i'm wrong), ironPython or jython could support utf-32 
characters. but it of course would mean that they could not use the 
'platform''s string for their string handling.

but the same way i could say, that because most of the unix-world is 
utf-8, for those pythons the best way is to handle it internally as 
utf-8, couldn't i?

it simply seems to me strange to make compromises that makes the life of 
the cpython-users harder, just to make the life for the 
jython/ironpython developers (i mean the 'creators') easier.

gabor