[Python-3000] How will unicode get used?
Gábor Farkas
gabor at nekomancer.net
Thu Sep 21 12:50:30 CEST 2006
Guido van Rossum wrote:
> On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
>> I wrote:
>>>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>>>> msg[35:-18]
>> u'"\U00010143"'
>>>>> greek_five = msg[36:-19]
>>>>> len(greek_five)
>> 2
>>
>>
>> After posting, I realized that it's worse than that. I suspect that if
>> I tried this on a CPython compiled with wide characters, then
>> len(greek_five) would be 1.
>>
>> What should it be? 2? 1? Implementation-dependent?
>
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices.
while i understand the constraints, i think it's not a good decision to
leave this to be implementation-dependent.
the strings seem to me as such a basic functionality, that it's
behaviour should not depend on the platform.
for example, how is an application developer then supposed to write
their applications?
should he write his own slicing/whatever functions to get consistent
behaviour on linux/windows?
i think this is not just a 'theoretical' issue. it's a very practical
issue. the only reason why it does not seem to be important, because
currently not much of the non-16-bit unicode characters are used.
(and this situation seems to be quite similar to that one, when only
8byte-characters were used :-)
btw. an idea:
==============
maybe this 'problem' should be separated into 2 issues:
1. representation of the unicode string (utf-16 or utf-32)
2. behaviour of the unicode strings in python-3000
of course there are some dependencies between them. (mostly the
performance of #2)
so why don't we make the *behaviour* cross-platform, and the
*performance characteristics* and the *representation* platform-dependent?
(means that jython/ironpython could use utf-16, but would slice strings
slower (because of the surrogate-issues))
================
> Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).
>
i don't see why there should be the only choice utf-16. it's the
obvious/most-convenient choice for jython/ironpython, that's correct.
but (correct me if i'm wrong), ironPython or jython could support utf-32
characters. but it of course would mean that they could not use the
'platform''s string for their string handling.
but the same way i could say, that because most of the unix-world is
utf-8, for those pythons the best way is to handle it internally as
utf-8, couldn't i?
it simply seems to me strange to make compromises that makes the life of
the cpython-users harder, just to make the life for the
jython/ironpython developers (i mean the 'creators') easier.
gabor
More information about the Python-3000
mailing list