[Python-Dev] Unicode charmap decoders slow

Tue Oct 4 18:44:16 CEST 2005

At 9:37 AM +0200 10/4/05, Walter Dörwald wrote:
>Am 04.10.2005 um 04:25 schrieb jepler at unpythonic.net:
>
>>As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is
>>very slow compared to encoding or decoding with utf-8. Here I'm working
>>with 53k of data instead of 53 megs. (Note: this is a laptop, so it's
>>possible that thermal or battery management features affected these
>>numbers a bit, but by a factor of 3 at most)
>>
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')"
>> 1000 loops, best of 3: 591 usec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
>> 1000 loops, best of 3: 1.25 msec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
>> 100 loops, best of 3: 13.5 msec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')"
>> 100 loops, best of 3: 13.6 msec per loop
>>
>> With utf-8 encoding as the baseline, we have
>>     decode('utf-8')      2.1x as long
>>     decode('mac-roman') 22.8x as long
>>     decode('iso8859-1') 23.0x as long
>>
>> Perhaps this is an area that is ripe for optimization.
>
>For charmap decoding we might be able to use an array (e.g. a tuple
>(or an array.array?) of codepoints instead of dictionary.
>
>Or we could implement this array as a C array (i.e. gencodec.py would
>generate C code).

Fine -- as long as it still allows changing code points.  I add the missing
"Apple logo" code point to mac-roman in order to permit round-tripping
(0xF0 <=> 0xF8FF, per Apple docs).  (New bug #1313051.)

If an all-C implementation wouldn't permit changing codepoints, I suggest
instead just /caching/ the translation in C arrays stored with the codec
object.  The cache would be invalidated on any write to the codec's mapping
dictionary, and rebuilt the next time anything was translated.  This would
maintain the present semantics, work with current codecs, and still provide
the desired speed improvement.

But is there really no way to say this fast in pure Python?  The way a
one-to-one byte mapping can be done with "".translate()?
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>