Is 0 > None?? (fwd) (fwd)

Alex Martelli aleax at aleax.it
Tue Sep 4 10:01:02 CEST 2001


"Lulu of the Lotus-Eaters" <mertz at gnosis.cx> writes:
    ...
> |Well, yes -- though the problem is not with the Unicode strings
> |(as I've shown, you can perfectly well compare them to anything:-)
> |The issue, it seems to me, is quite different wrt complex
> |numbers -- in that case, there is _no_ setting that lets you
> |default-sort an heterogeneous list that may contain a complex
> |number somewhere.  The Unicode issue might be finessed in
> |several ways.  The complex-number one is starker IMHO.
> 
> I don't understand the differenence Alex points to.  For example, the
> below cases seem awfully similar:

*Look* similar -- on the surface; that's quite a long shot from
*being* similar:-).

>     >>> # Try to sort Unicode values in list
>     >>> lst = [ 128, chr(128), unichr(128), 'a' ]
>     >>> lst
>     [128, '\x80', u'\x80', 'a']
>     >>> lst.sort()
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     UnicodeError: ASCII decoding error: ordinal not in range(128)

See the error message?  It tells you the 'ascii' encoding (which
you have chosen to use in 'strict' mode) is unable to convert a
plain string to Unicode.  Comparisons have nothing specific to do
with it: any operation requiring plan->Unicode string conversion
will produce exactly the same error if you use plain strings that
violate the rules of the default encoding you use.

For example:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> lst = ['128', '\x80', u'\x80', 'a']
>>> import operator
>>> reduce(operator.add, lst)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
>>>

See?  *Exactly* the same error -- and we're definitely not
doing comparisons here, are we?  It's *NOT* about comparing
whatever to whatever else: it IS about implicit conversion
of plain->Unicode string, using a default-encoding whose
rules are in fact NOT respected by the strings you supply
to it.  One way, out of a zillion, in which you can discover
your problem (that you are encoding strings in a way that
is different from the default encoding you have chosen to
use) is of course by comparisons, and one way in which you
can cause comparisons is sorting, but it's a long and thin
chain of indirect causation.  The real, close cause of your
problem is: you're using in your strings an encoding that
differs from the encoding you CHOSE to use (here, you chose
by keeping the default encoding of 'ascii', which I already
singled out as a debatable choice) -- and the conversion
uses the 'strict' error mode, so you are forcibly reminded
of the mismatch between the encoding you CHOSE to use and
the one you are ACTUALLY using in one or more plain strings.

OK so far?  Don't misdiagnose as "a comparison problem" (or
even worse, specifically "a sorting problem") an issue that
has precious little to do with comparison (and is thus even
less-directly connected to sorting), as it surfaces in most
any operation and is in fact connected to *string encoding*:
it's a mismatch between the encoding you chose and the ones
you are actually using.


>     >>> # Try to sort complex numbers in list
>     >>> l2 = [ 1, 1.0, 1+1j ]
>     >>> l2
>     [1, 1.0, (1+1j)]
>     >>> l2.sort()
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     TypeError: cannot compare complex numbers using <, <=, >, >=

THIS one, on the other hand, IS strictly, solely, exclusively
and ONLY a comparison problem: it has nothing to do with encoding
or other tangential issues.  Indeed, for example:

>>> l2 = [ 1, 1.0, 1+1j ]
>>> reduce(operator.add, l2)
(3+1j)
>>>

any operation that doesn't perform comparisons works just fine,
while it *WOULDN'T* work for operations between Unicode strings
and plain-strings that are mis-encoded wrt the chosen default
encoding -- as we showed a few lines ago, performing exactly the
same thing (reduce with operator.add) on a heterogeneous list of
plain and Unicode strings (with some of the plain strings
mis-encoded) as we're doing here on a heterogeneous list of
complex and real numbers (mis-encoding not an issue here, of
course -- it's a completely different root issue!).

See the difference now?


> What can I do to get 'lst' to sort properly?  

As I must already have said 3 or 4 times at least on this thread
(so PLEASE don't anybody complain about my repetitiveness and
verbosity -- clearly, for some readers I'm not being repetitious,
redundant and verbose *ENOUGH*!-), all you have to do is provide
as the default-encoding the encoding that you DO mean to use in
your plain-strings.  For that purpose, Python supplies function
    sys.setdefaultencoding
which, perhaps unfortunately, gets deliberately *removed* from
the sys module at the end of execution of site.py -- so you have
to set your default encoding BEFORE site.py is done.  You can
do it by editing site.py (it's, I believe, the only standard
Python library module that is MEANT to be used-edited: it sports
a few "if 0:" blocks with comments about "enabling" the code)
or by providing a sitecustomize.py that sets the encoding you
actually *WANT* to use (rather than setting, or accepting, one 
that you DON'T want).

For example, here's the sitecustomize.py I normally use:
D:\ian\good>type \python21\lib\sitecustomize.py
import sys
sys.setdefaultencoding("latin-1")

and here's a session showing that this provides full support
for plainstrings that ARE encoded in latin-1 (aka ISO-8859-1)
in all sorts of mixed-operations with Unicode strings,
including, as one special case out of a zillion, sorting:

>>> import sys
>>> sys.getdefaultencoding()
'latin-1'
>>> lst = ['128', '\x80', u'\x80', 'a']
>>> import operator
>>> reduce(operator.add, lst)
u'128\x80\x80a'
>>> lst.sort()
>>> lst
['128', 'a', '\x80', u'\x80']
>>>

OK so far?  For sorting OR for any other operation between
plain and Unicode strings, the plain strings need to be
encoded with the default encoding you chose.  The "default
default encoding" is ASCII, but you can change that in
site.py or sitecustomize.py.  Not an ideal system (I think
a "default default" of Latin-1 would cover far more sites,
although maybe our Eastern-European and Asian friends would
not be happy; and a way to provide more suppleness as to
when one can set/change the default-encoding WOULD be nice,
though I guess it might give problems with threads &c),
but it does behave consistently -- and it's only VERY
tangentially related to sorting, or comparisons.  OK?

> If the answer is to use a
> custom comparison function, I can do likewise with complex numbers.  I

You can patch over the specific problem of sorting by using
a comparison function that works differently than the stock
one, but while this finesses the real problem ENTIRELY when 
complex numbers are involved (as the ONLY issue is about
comparison in that case!), it definitely does no more than
scratch the surface when mis-encoded plain strings are
involved -- e.g., it does nothing for "reduce(operator.add,..."
and the zillion other ways in which you can hit against the
real underlying issue (mis-encoded plain strings).

> suppose one could say the problem was with the 'chr(128)' element rather
> than the 'unichr(128)' element... but such a characterization doesn't
> particularly bring a solution closer.

Of course it does: it takes you to within 0.18 microns of a
solution -- it takes some amount of determinedly looking the
other way to miss it once you're there.  The problem IS that
you have a mis-encoded string: a PLAIN string that does NOT
respect the default-encoding that you are using.  The real
solution therefore is to fix the mis-encoding: either change
the default-encoding, so it matches the encoding you're in
fact using in your strings (and this is something you can
only do in site.py or sitecustomize.py), OR, explicitly
change the encoding of your plain-strings before you apply
any operation that might otherwise need to widen them up
to Unicode (using a now-assumedly-unchangeable encoding).

E.g.:

def cleancoding(fleep, mycodec='latin-1'):
    try: return unicode(fleep, mycodec)
    except TypeError: return fleep

and ensure your list is spotlessly-encoded by:

lst[:] = map(cleancoding, lst)

or

lst[:] = [cleancoding(x) for x in lst]

or whatever.  This has precious little to do with sorting,
or with comparisons, but it will help the problems that
sorting is apparently revealing in your plainstrings'
encodings (just as it would help "reduce(operator.add",
etc etc).


> P.S. The thing that could be fixed in a later Python is the
> non-universal comparison... we are stuck with 'print >>' for backwards
> compatibility.

We're stuck with print>>, but we're also stuck with the
widening of plain-strings up to Unicode ones using the
current default-encoding -- some hypothetical new approach
to "universal comparison" can't just ignore the default
encoding, without breaking working code, and thus violating
backwards compatibility.  E.g., to make my code work with
any default-encoding, I could just assume the default
encoding is latin-1 until and unless a UnicodeError tells
me otherwise, and thus code:

    try: lst.sort()
    except UnicodeError:
        lst[:] = map(cleancoding, lst)
        lst.sort()

This is valid and perfectly idiomatic Python.  If the
comparison of plain to unicode strings didn't widen the
plainstring, but rather performed some arbitrary kind
of comparison unrelated to the encoding, this would
stop working.

There IS space to enhance the widen-to-Unicode behavior,
I think -- maybe, allow non-strict errors as the default,
rather than forcing strict ones -- although I guess that,
with all the thought and debate devoted to this issue by
Lemburg and others about 12 to 18 months ago, the various
possible approaches must have been explored in depth,
already.  "Universal comparison" isn't really part of
any solution to this widening issue.

Comparison issues, and sorting of heterogeneous lists in
particular, ARE central to the new noncomparable behavior
of complex numbers.  I see no solution to THAT one, save:
    a. either drop the pretense that it makes sense to
        compare heterogeneous entities, or
    b. regress the change that makes it illegal to compare
        complex numbers.
[a] might be conceptually cleaner but would break oodles
of existing code, I think.  [b] might be deemed inelegant
by some, but it WOULD be by far the most practical approach.

And confusing the already-difficult issue by throwing in
the mix an _unrelated_ one (plain->unicode widening of
inappropriately-encoded plainstrings wrt the current
default encoding, which MAY be triggered by sorting as
well as by a zillion other operations) isn't helping, in
my personal opinion:-).


Alex






More information about the Python-list mailing list