[Python-Dev] Unicode charmap decoders slow

Tue Oct 4 09:37:29 CEST 2005

Am 04.10.2005 um 04:25 schrieb jepler at unpythonic.net:

> As the OP suggests, decoding with a codec like mac-roman or  
> iso8859-1 is very
> slow compared to encoding or decoding with utf-8.  Here I'm working  
> with 53k of
> data instead of 53 megs.  (Note: this is a laptop, so it's possible  
> that
> thermal or battery management features affected these numbers a  
> bit, but by a
> factor of 3 at most)
>
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')"
> 1000 loops, best of 3: 591 usec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
> 1000 loops, best of 3: 1.25 msec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
> 100 loops, best of 3: 13.5 msec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')"
> 100 loops, best of 3: 13.6 msec per loop
>
> With utf-8 encoding as the baseline, we have
>     decode('utf-8')      2.1x as long
>     decode('mac-roman') 22.8x as long
>     decode('iso8859-1') 23.0x as long
>
> Perhaps this is an area that is ripe for optimization.

For charmap decoding we might be able to use an array (e.g. a tuple  
(or an array.array?) of codepoints instead of dictionary.

Or we could implement this array as a C array (i.e. gencodec.py would  
generate C code).

Bye,
    Walter Dörwald