[Python-Dev] Unicode charmap decoders slow

Thu Oct 6 11:09:51 CEST 2005

Walter Dörwald wrote:
> Martin v. Löwis wrote:
> 
>> Hye-Shik Chang wrote:
>>
>>> If the encoding optimization can be easily done in Walter's approach,
>>> the fastmap codec would be too expensive way for the objective because
>>> we must maintain not only fastmap but also charmap for backward
>>> compatibility.
>>
>>
>> IMO, whether a new function is added or whether the existing function
>> becomes polymorphic (depending on the type of table being passed) is
>> a minor issue. Clearly, the charmap API needs to stay for backwards
>> compatibility; in terms of code size or maintenance, I would actually
>> prefer separate functions.
> 
> 
> OK, I can update the patch accordingly. Any suggestions for the name?
> PyUnicode_DecodeCharmapString?

No, you can factor this part out into a separate C function
- there's no need to add a completely new entry point just
for this optimization. Later on we can then also add support
for compressed tables to the codec in the same way.

>> One issue apparently is people tweaking the existing dictionaries,
>> with additional entries they think belong there. I don't think we
>> need to preserve compatibility with that approach in 2.5, but I
>> also think that breakage should be obvious: the dictionary should
>> either go away completely at run-time, or be stored under a
>> different name, so that any attempt of modifying the dictionary
>> gives an exception instead of having no interesting effect.
> 
> 
> IMHO it should be stored under a different name, because there are
> codecs (c037, koi8_r, iso8859_11), that reuse existing dictionaries.

Only koi8_u reuses the dictionary from koi8_r - and it's
easy to recreate the codec from a standard mapping file.

> Or we could have a function that recreates the dictionary from the string.

Actually, I'd prefer that these operations be done by the
codec generator script, so that we don't have additional
startup time. The dictionaries should then no longer be
generated and instead. I'd like the comments to stay, though.
This can be done like this (using string concatenation
applied by the compiler):

decoding_charmap = (
    u'x' # 0x0000 -> 0x0078 LATIN SMALL LETTER X
    u'y' # 0x0001 -> 0x0079 LATIN SMALL LETTER Y
    ...
)

Either way, monkey patching the codec won't work anymore.
Doesn't really matter, though, as this was never officially
supported.

We've always told people to write their own codecs
if they need to modify an existing one and then hook it into
the system using either a new codec search function or by
adding an appropriate alias.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 06 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::