[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 04:52:22 CEST 2005

[Recipient list not trimmed, as my replies must be vetted by a moderator,
which seems to delay them. :]

At 11:48 PM +0200 10/4/05, Walter Dörwald wrote:
>Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:
>
>> Walter Dörwald wrote:
>>
>>> For charmap decoding we might be able to use an array (e.g. a
>>> tuple  (or an array.array?) of codepoints instead of dictionary.
>>>
>>
>> This array would have to be sparse, of course.
>
>For encoding yes, for decoding no.
>
>> Using an array.array would be more efficient, I guess - but we would
>> need a C API for arrays (to validate the type code, and to get ob_item).
>
>For decoding it should be sufficient to use a unicode string of
>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>string might be shorter and byte values greater than the length of
>the string are treated as "maps to undefined" too.

With Unicode using more than 64K codepoints now, it might be more forward
looking to use a table of 256 32-bit values, with no need for tricky
values.  There is no need to add any C code to the codecs; just add some
more code to the existing C function (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c).

 ...
>> For encoding, having a C trie might give considerable speedup. _codecs
>> could offer an API to convert the current dictionaries into
>> lookup-efficient structures, and the conversion would be done when
>> importing the codec.
>>
>> For the trie, two levels (higher and lower byte) would probably be
>> sufficient: I believe most encodings only use 2 "rows" (256 code
>> point blocks), very few more than three.
>
>This might work, although nobody has complained about charmap
>encoding yet. Another option would be to generate a big switch
>statement in C and let the compiler decide about the best data
>structure.

I'm willing to complain. :)  I might allow saving of my (53 MB) MBox file.
(Not that editing received mail makes as much sense as searching it.)

Encoding can be made fast using a simple hash table with external chaining.
There are max 256 codepoints to encode, and they will normally be well
distributed in their lower 8 bits.  Hash on the low 8 bits (just mask), and
chain to an area with 256 entries.  Modest storage, normally short chains,
therefore fast encoding.

At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote:

>I would try to avoid generating C code at all costs. Maintaining the
>build processes will just be a nightmare.

I agree; also I don't think the generated codecs need to be changed at all.
All the changes can be made to the existing C functions, by adding caching
per a reply of mine that hasn't made it to the list yet.  Well, OK,
something needs to hook writes to the codec's dictionary, but I /think/
that only needs Python code.  I say:

>...I suggest instead just /caching/ the translation in C arrays stored
>with the codec object.  The cache would be invalidated on any write to the
>codec's mapping dictionary, and rebuilt the next time anything was
>translated.  This would maintain the present semantics, work with current
>codecs, and still provide the desired speed improvement.

Note that this caching is done by new code added to the existing C
functions (which, if I have it right, are in unicodeobject.c).  No
architectural changes are made; no existing codecs need to be changed;
everything will just work, and usually work faster, with very modest memory
requirements of one 256 entry array of 32-bit Unicode values and a hash
table with 256 1-byte slots and 256 chain entries, each having a 4 byte
Unicode value, a byte output value, a byte chain index, and probably 2
bytes of filler, for a hash table size of 2304 bytes per codec.
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>