[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 17:06:06 CEST 2005

On 10/5/05, M.-A. Lemburg <mal at egenix.com> wrote:
> Of course, a C version could use the same approach as
> the unicodedatabase module: that of compressed lookup
> tables...
>
>         http://aggregate.org/TechPub/lcpc2002.pdf
>
> genccodec.py anyone ?
>

I had written a test codec for single byte character sets to evaluate
algorithms to use in CJKCodecs once before  (it's not a direct
implemention of you've mentioned, tough) I just ported it
to unicodeobject (as attached).  It showed relatively fine result
than charmap codecs:

% python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)"
"s.decode('iso8859-1')"
10 loops, best of 3: 96.7 msec per loop
% ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)"
"s.decode('iso8859_10_fc')"
10 loops, best of 3: 22.7 msec per loop
% ./python ./Lib/timeit.py -s "s='a'*1024*1024; u=unicode(s)"
"s.decode('utf-8')"
100 loops, best of 3: 18.9 msec per loop

(Note that it doesn't contain any documentation nor good error
handling yet. :-)

Hye-Shik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fastmapcodec.diff
Type: application/octet-stream
Size: 18814 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-dev/attachments/20051006/2106c236/fastmapcodec-0001.obj