Re: Unicode and comparisons

Question: is this behaviour acceptable or should I go even further and mask decoding errors during compares and contains tests too ?
I always thought it is a core property of cmp that it works between all objects. Because of that,
fails. As always in cmp, I'd expect to get a consistent outcome here (ie. cmp should give a total order on objects). OTOH, I'm not so sure why cmp between plain and unicode strings needs to perform UTF-8 conversion? IOW, why is it desirable that
'a' == u'a' 1
Anyway, I'm not objecting to that outcome - I only think that, to get cmp consistent, it may be necessary to drop this result. If it is not necessary, the better. Regards, Martin

"Martin v. Loewis" wrote:
It does, but not necessarily without exceptions. I could easily mask the decoding errors too and then have cmp() work exactly as for strings, but the outcome may be different to what the user had expected due to the failing conversion. Sorting order may then look quite unsorted...
This is needed to enhance inter-operability between Unicode and normal strings. Note that they also have the same hash value (provided both use the ASCII code range), making them interchangeable in dictionaries:
This is per design.
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I always thought it is a core property of cmp that it works between all objects.
Not any more. Comparisons can raise exceptions -- this has been so since release 1.5. This is rarely used between standard objects, but not unheard of; and class instances can certainly do anything they want in their __cmp__. --Guido van Rossum (home page: http://www.python.org/~guido/)

Hi! Guido van Rossum:
Python 1.6a1 (#6, Apr 2 2000, 02:32:06) [GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
IMO we will have a *very* hard to time to explain *this* behaviour to newbiews! Unicode objects are similar to normal string objects from the users POV. It is unintuitive that objects that are far less similar (like for example numbers and strings) compare the way they do now, while the attempt to compare an unicode string with a standard string object containing the same character raises an exception. Mit freundlichen Grüßen (Regards), Peter (BTW: using an 12year old US keyboard and a custom xmodmap all the time to write umlauts lots of other interisting chars: ÷× ± ²³ ½¼ ° µ «» ¿? ¡! ;-)

Peter Funk wrote:
I don't think newbies will really want to get into the UTF-8 business right from the start... when they do, they probably know about the above problems already. Changing this behaviour to silently swallow the decoding error would cause more problems than do good, IMHO. Newbies sure would find (u'a' not in 'aäöü') == 1 just as sursprising... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

"Martin v. Loewis" wrote:
It does, but not necessarily without exceptions. I could easily mask the decoding errors too and then have cmp() work exactly as for strings, but the outcome may be different to what the user had expected due to the failing conversion. Sorting order may then look quite unsorted...
This is needed to enhance inter-operability between Unicode and normal strings. Note that they also have the same hash value (provided both use the ASCII code range), making them interchangeable in dictionaries:
This is per design.
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I always thought it is a core property of cmp that it works between all objects.
Not any more. Comparisons can raise exceptions -- this has been so since release 1.5. This is rarely used between standard objects, but not unheard of; and class instances can certainly do anything they want in their __cmp__. --Guido van Rossum (home page: http://www.python.org/~guido/)

Hi! Guido van Rossum:
Python 1.6a1 (#6, Apr 2 2000, 02:32:06) [GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
IMO we will have a *very* hard to time to explain *this* behaviour to newbiews! Unicode objects are similar to normal string objects from the users POV. It is unintuitive that objects that are far less similar (like for example numbers and strings) compare the way they do now, while the attempt to compare an unicode string with a standard string object containing the same character raises an exception. Mit freundlichen Grüßen (Regards), Peter (BTW: using an 12year old US keyboard and a custom xmodmap all the time to write umlauts lots of other interisting chars: ÷× ± ²³ ½¼ ° µ «» ¿? ¡! ;-)

Peter Funk wrote:
I don't think newbies will really want to get into the UTF-8 business right from the start... when they do, they probably know about the above problems already. Changing this behaviour to silently swallow the decoding error would cause more problems than do good, IMHO. Newbies sure would find (u'a' not in 'aäöü') == 1 just as sursprising... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (4)
-
Guido van Rossum
-
M.-A. Lemburg
-
Martin v. Loewis
-
pf@artcom-gmbh.de