[Python-Dev] Unicode charmap decoders slow

Walter Dörwald walter at livinglogic.de
Thu Oct 6 09:28:05 CEST 2005


Martin v. Löwis wrote:

> Walter Dörwald wrote:
> 
>> OK, here's a patch that implements this enhancement to 
>> PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
> 
> Looks nice!
> 
>> Creating the decoding_map as a string should probably be done by 
>> gencodec.py directly. This way the first import of the codec would be 
>> faster too.
> 
> Hmm. How would you represent the string in source code? As a Unicode
> literal? With \u escapes,

Yes, simply by outputting repr(decoding_string).

> or in a UTF-8 source file?

This might get unreadable, if your editor can't detect the coding header.

> Or as a UTF-8
> string, with an explicit decode call?

This is another possibility, but is unreadable too. But we might add the 
real codepoints as comments.

> I like the current dictionary style for being readable, as it also
> adds the Unicode character names into comments.

We could use

decoding_string = (
    u"\u009c" # 0x0004 -> U+009C: CONTROL
    u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION
    ...
)

However the current approach has the advantage, that only those byte 
values that differ from the identical mapping have to be specified.

Bye,
    Walter Dörwald


More information about the Python-Dev mailing list