[Python-Dev] Unicode charmap decoders slow
Walter Dörwald
walter at livinglogic.de
Wed Oct 5 17:08:04 CEST 2005
Martin v. Löwis wrote:
> Tony Nelson wrote:
>
>>> For decoding it should be sufficient to use a unicode string of
>>> length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>> string might be shorter and byte values greater than the length of
>>> the string are treated as "maps to undefined" too.
>>
>> With Unicode using more than 64K codepoints now, it might be more forward
>> looking to use a table of 256 32-bit values, with no need for tricky
>> values.
>
> You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
> which would indicate that the byte with that index is really unused
> in that encoding.
OK, here's a patch that implements this enhancement to
PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
The mapping argument to PyUnicode_DecodeCharmap() can be a unicode
string and is used as a decoding table.
Speed looks like this:
python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 538 usec per loop
python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
100 loops, best of 3: 3.85 msec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 539 usec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
1000 loops, best of 3: 623 usec per loop
Creating the decoding_map as a string should probably be done by
gencodec.py directly. This way the first import of the codec would be
faster too.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list