[Python-Dev] Unicode charmap decoders slow
Tony Nelson
tonynelson at georgeanelson.com
Wed Oct 5 04:52:22 CEST 2005
[Recipient list not trimmed, as my replies must be vetted by a moderator,
which seems to delay them. :]
At 11:48 PM +0200 10/4/05, Walter Dörwald wrote:
>Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:
>
>> Walter Dörwald wrote:
>>
>>> For charmap decoding we might be able to use an array (e.g. a
>>> tuple (or an array.array?) of codepoints instead of dictionary.
>>>
>>
>> This array would have to be sparse, of course.
>
>For encoding yes, for decoding no.
>
>> Using an array.array would be more efficient, I guess - but we would
>> need a C API for arrays (to validate the type code, and to get ob_item).
>
>For decoding it should be sufficient to use a unicode string of
>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>string might be shorter and byte values greater than the length of
>the string are treated as "maps to undefined" too.
With Unicode using more than 64K codepoints now, it might be more forward
looking to use a table of 256 32-bit values, with no need for tricky
values. There is no need to add any C code to the codecs; just add some
more code to the existing C function (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c).
...
>> For encoding, having a C trie might give considerable speedup. _codecs
>> could offer an API to convert the current dictionaries into
>> lookup-efficient structures, and the conversion would be done when
>> importing the codec.
>>
>> For the trie, two levels (higher and lower byte) would probably be
>> sufficient: I believe most encodings only use 2 "rows" (256 code
>> point blocks), very few more than three.
>
>This might work, although nobody has complained about charmap
>encoding yet. Another option would be to generate a big switch
>statement in C and let the compiler decide about the best data
>structure.
I'm willing to complain. :) I might allow saving of my (53 MB) MBox file.
(Not that editing received mail makes as much sense as searching it.)
Encoding can be made fast using a simple hash table with external chaining.
There are max 256 codepoints to encode, and they will normally be well
distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and
chain to an area with 256 entries. Modest storage, normally short chains,
therefore fast encoding.
At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote:
>I would try to avoid generating C code at all costs. Maintaining the
>build processes will just be a nightmare.
I agree; also I don't think the generated codecs need to be changed at all.
All the changes can be made to the existing C functions, by adding caching
per a reply of mine that hasn't made it to the list yet. Well, OK,
something needs to hook writes to the codec's dictionary, but I /think/
that only needs Python code. I say:
>...I suggest instead just /caching/ the translation in C arrays stored
>with the codec object. The cache would be invalidated on any write to the
>codec's mapping dictionary, and rebuilt the next time anything was
>translated. This would maintain the present semantics, work with current
>codecs, and still provide the desired speed improvement.
Note that this caching is done by new code added to the existing C
functions (which, if I have it right, are in unicodeobject.c). No
architectural changes are made; no existing codecs need to be changed;
everything will just work, and usually work faster, with very modest memory
requirements of one 256 entry array of 32-bit Unicode values and a hash
table with 256 1-byte slots and 256 chain entries, each having a 4 byte
Unicode value, a byte output value, a byte chain index, and probably 2
bytes of filler, for a hash table size of 2304 bytes per codec.
____________________________________________________________________
TonyN.:' <mailto:tonynelson at georgeanelson.com>
' <http://www.georgeanelson.com/>
More information about the Python-Dev
mailing list