[Python-Dev] Unicode charmap decoders slow

Tue Oct 4 03:11:29 CEST 2005

Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8
than going through unicode()?

I'm writing a small card-file program. As a test, I use a 53 MB MBox file,
in mac-roman encoding.  My program reads and parses the file into messages
in about 3 to 5 seconds (Wow! Go Python!), but takes about 14 seconds to
iterate over the cards and convert them to utf-8:

    for i in xrange(len(cards)):
        u = unicode(cards[i], encoding)
        cards[i] = u.encode('utf-8')

The time is nearly all in the unicode() call.  It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just to
do table lookups.

Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character.  I would have thought that it would
make and cache a LUT the size of the charmap (and hook the relevent
dictionary stuff to delete the cached LUT if the dictionary is changed).
(You may consider this a request for enhancement. ;)

I thought of using U"".translate(), but the unicode version is defined to
be slow, and anyway I can't find any way to just shove my 8-bit data into a
unicode string without translation.  Is there some similar approach?  I'm
almost (but not quite) ready to try it in Pyrex.

I'm new to Python.  I didn't google anything relevent on python.org or in
groups.  I posted this in comp.lang.python yesterday, got a couple of
responses, but I think this may be too sophisticated a question for that
group.

I'm not a member of this list, so please copy me on replies so I don't have
to hunt them down in the archive.
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>