python 3's adoption

Fri Jan 29 10:35:50 EST 2010

* Steven D'Aprano:
> On Fri, 29 Jan 2010 07:10:01 +0100, Alf P. Steinbach wrote:
> 
>>    >>> L = ["æ", "ø", "å"]   # This is in SORTED ORDER in Norwegian L
> 
> [...]
> 
>>    >>> L.sort( key = locale.strxfrm )
>>    >>> L
>>    ['å', 'æ', 'ø']
>>    >>> locale.strcoll( "å", "æ" )
>>    1
>>    >>> locale.strcoll( "æ", "ø" )
>>    -1
>>
>> Note that strcoll correctly orders the strings as ["æ", "ø", "å"], that
>> is, it would have if it could have been used as cmp function to sort (or
>> better, to a separate routine named e.g. custom_sort).
> 
> This is in Python2.5, so I'm explicitly specifying unicode strings:
> 
>>>> L = [u"æ", u"ø", u"å"]
>>>> assert sorted(L) == [u'å', u'æ', u'ø']
> 
> The default C-locale sorting of L does not equal to L. Now let's change 
> to Norwegian locale:
> 
>>>> import locale
>>>> locale.setlocale(locale.LC_ALL, 'nb_NO')
> 'nb_NO'
>>>> print u''.join(sorted(L, cmp=locale.strcoll))
> æøå
> 
> So far so good, we've confirmed that in Python 2.5 we can sort in a 
> locale-aware form. Now, can we do this with a key function? Thanks to 
> Raymond Hettinger's recipe here:
> 
> http://code.activestate.com/recipes/576653/
> 
> 
>>>> print u''.join(sorted(L, key=CmpToKey(locale.strcoll)))
> æøå
> 
> 
> Success!
> 
> Let's try it in Python 3.1:
> 
>>>> L = ["æ", "ø", "å"]
>>>> assert sorted(L) == ['å', 'æ', 'ø']
>>>>
>>>> import locale
>>>> locale.setlocale(locale.LC_ALL, 'nb_NO')
> 'nb_NO'

Hm. A bit off-topic, but...

   >>> import locale
   >>> locale.setlocale(locale.LC_ALL, 'nb_NO')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "C:\Program Files\cpython\python31\lib\locale.py", line 527, in
       return _setlocale(category, locale)
   locale.Error: unsupported locale setting
   >>> _

This on a machine where the Python default locale is Norwegian.

>>>> ''.join(sorted(L, key=CmpToKey(locale.strcoll)))
> 'æøå'
> 
> 
> The definition of CmpToKey can be found at the URL above. It's not very 
> exciting, but here it is:
> 
> 
> def CmpToKey(mycmp):
>     'Convert a cmp= function into a key= function'
>     class K(object):
>         def __init__(self, obj, *args):
>             self.obj = obj
>         def __lt__(self, other):
>             return mycmp(self.obj, other.obj) == -1
>         def __gt__(self, other):
>             return mycmp(self.obj, other.obj) == 1
>         def __eq__(self, other):
>             return mycmp(self.obj, other.obj) == 0
>         def __le__(self, other):
>             return mycmp(self.obj, other.obj) != 1  
>         def __ge__(self, other):
>             return mycmp(self.obj, other.obj) != -1
>         def __ne__(self, other):
>             return mycmp(self.obj, other.obj) != 0
>     return K

This is pretty smart as a generic solution.

Thanks! :-)

I was thinking more of sticking those comparisions in some custom string class 
or such, which would be rather ugly...

The above does have quite high overhead though!

That is, it's /inefficient/ compared to using a 'cmp' function directly.

> If that's too verbose for you, stick this as a helper function in your 
> application:
> 
> 
> def CmpToKey(mycmp):
>     'Convert a cmp= function into a key= function'
>     class K(object):
>         def __init__(self, obj, *args):
>             self.obj = obj
>         __lt__ = lambda s, o: mycmp(s.obj, o.obj) == -1
>         __gt__ = lambda s, o: mycmp(s.obj, o.obj) == 1
>         __eq__ = lambda s, o: mycmp(s.obj, o.obj) == 0
>         __le__ = lambda s, o: mycmp(s.obj, o.obj) != 1
>         __ge__ = lambda s, o: mycmp(s.obj, o.obj) != -1
>         __ne__ = lambda s, o: mycmp(s.obj, o.obj) != 0
>     return K
> 
> 
> [...]
>> The above may just be a bug in the 3.x stxfrm. But it illustrates that
>> sometimes you have your sort order defined by a comparision function.
>> Transforming that into a key can be practically impossible (it can also
>> be quite inefficient).
> 
> This might be true, but you haven't demonstrated it.

The "can be ... practically impossible" was hogwash. I posted late, sorry. The 
"quite inefficient" holds.

> With one little 
> helper function, you should be able to convert any comparison function 
> into a key function, with relatively little overhead.

I wouldn't call an extra Python method call per comparision "relatively little 
overhead".

Note that the same number of comparisions as with a 'cmp' based sort is performed.

But with the wrapper every comparision is indirected through a Python method 
call (I think, but not sure, that the creation of those comparision objects does 
not introduce significant overhead, but the calls must).

<code>
# -*- coding: utf-8 -*-
from __future__ import print_function
from __future__ import unicode_literals
try:
     range = xrange
except:
     pass

import locale
import sys
import random
import timeit

def CmpToKey( mycmp ):
     "Convert a cmp= function into a key= function"
     class K( object ):
         def __init__( self, obj, *args ):
             self.obj = obj
         def __lt__( self, other ):
             return mycmp( self.obj, other.obj ) == -1
         def __gt__( self, other ):
             return mycmp( self.obj, other.obj ) == 1
         def __eq__( self, other ):
             return mycmp( self.obj, other.obj ) == 0
         def __le__( self, other ):
             return mycmp( self.obj, other.obj ) != 1
         def __ge__( self, other ):
             return mycmp( self.obj, other.obj ) != -1
         def __ne__( self, other ):
             return mycmp( self.obj, other.obj ) != 0
     return K

def elapsed_time_for( f ):
     n_calls = 1
     return timeit.timeit( f, number = n_calls )

def major_pyversion():
     return sys.version_info[0]

def sorting_of( string_list ):
     def sort2x():
         string_list.sort( cmp = locale.strcoll )
     def sort3x():
         string_list.sort( key = CmpToKey( locale.strcoll ) )
     return sort2x if major_pyversion() <= 2 else sort3x

alphabet        = "abcdefghijklmnopqrstuvwxyzæøå"
n_strings       = int( 1e5 )
nbno_locale     = ""        # "nb_NO" raises exception on my machine.

def rnd_string():
     len = random.randint( 1, 80 )
     s = ""
     for i in range( len ):
         s = s + random.choice( alphabet )
     return s

locale.setlocale( locale.LC_ALL, nbno_locale )
random.seed( 666 )
assert( rnd_string() == "æmoxqudxlszåaåcteærmuocøjrdæpbvyabwhn" )

print( "Creating array of pseudo-random strings..." )
L = [rnd_string() for i in range( n_strings )]
print( "Sorting..." )
time = elapsed_time_for( sorting_of( L ) )
print( "Last string is '{0}'".format( L[-1] ) )
print( "{0}".format( time ) )
assert( L[-1][0] == "å" )
</code>

<example>
C:\Documents and Settings\Alf\test> py3 sort.py
Creating array of pseudo-random strings...
Sorting...
Last string is 
'åååywxuybrxgvnstoyvcmdåjdlnxwwfsbnzsinxncmtrxuxgtqtduzvåixqovtjxiflfesuvfa'
8.45987772189

C:\Documents and Settings\Alf\test> py2 sort.py
Creating array of pseudo-random strings...
Sorting...
Last string is 
'åååywxuybrxgvnstoyvcmdåjdlnxwwfsbnzsinxncmtrxuxgtqtduzvåixqovtjxiflfesuvfa'
3.41336790011

C:\Documents and Settings\Alf\test> _
</example>

In the run above Python 3.x was only roughly 2.5 times slower. On my machine it 
ranges from 2.5 times slower to 3 times slower. Most often around 2.5 x slower.

Of course this does not measure how much faster a Python 3.x cmp/"<" based sort 
/could be/, only how slow, sluggish and snail-like it is compared to 2.x. :-)

In other words, that's the speed price, 2.5 to 3 times slower sorting, for 
ditching cmp instead of e.g. replacing it with something better.

Cheers & hth.,

- Alf