[Python-Dev] Unicode charmap decoders slow

Thu Oct 6 10:51:47 CEST 2005

Martin v. Löwis wrote:

> Hye-Shik Chang wrote:
> 
>> If the encoding optimization can be easily done in Walter's approach,
>> the fastmap codec would be too expensive way for the objective because
>> we must maintain not only fastmap but also charmap for backward
>> compatibility.
> 
> IMO, whether a new function is added or whether the existing function
> becomes polymorphic (depending on the type of table being passed) is
> a minor issue. Clearly, the charmap API needs to stay for backwards
> compatibility; in terms of code size or maintenance, I would actually
> prefer separate functions.

OK, I can update the patch accordingly. Any suggestions for the name? 
PyUnicode_DecodeCharmapString?

> One issue apparently is people tweaking the existing dictionaries,
> with additional entries they think belong there. I don't think we
> need to preserve compatibility with that approach in 2.5, but I
> also think that breakage should be obvious: the dictionary should
> either go away completely at run-time, or be stored under a
> different name, so that any attempt of modifying the dictionary
> gives an exception instead of having no interesting effect.

IMHO it should be stored under a different name, because there are 
codecs (c037, koi8_r, iso8859_11), that reuse existing dictionaries.

Or we could have a function that recreates the dictionary from the string.

> I envision a layout of the codec files like this:
> 
> decoding_dict = ...
> decoding_map, encoding_map = codecs.make_lookup_tables(decoding_dict)

Apart from the names (and the fact that encoding_map is still a 
dictionary), that's what my patch does.

> I think it should be possible to build efficient tables in a single
> pass over the dictionary, so startup time should be fairly small
> (given that the dictionaries are currently built incrementally, anyway,
> due to the way dictionary literals work).

Bye,
    Walter Dörwald