[Python-Dev] Unicode charmap decoders slow

Walter Dörwald walter at livinglogic.de
Thu Oct 6 09:28:05 CEST 2005

Martin v. Löwis wrote:

> Walter Dörwald wrote:
>> OK, here's a patch that implements this enhancement to 
>> PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
> Looks nice!
>> Creating the decoding_map as a string should probably be done by 
>> gencodec.py directly. This way the first import of the codec would be 
>> faster too.
> Hmm. How would you represent the string in source code? As a Unicode
> literal? With \u escapes,

Yes, simply by outputting repr(decoding_string).

> or in a UTF-8 source file?

This might get unreadable, if your editor can't detect the coding header.

> Or as a UTF-8
> string, with an explicit decode call?

This is another possibility, but is unreadable too. But we might add the 
real codepoints as comments.

> I like the current dictionary style for being readable, as it also
> adds the Unicode character names into comments.

We could use

decoding_string = (
    u"\u009c" # 0x0004 -> U+009C: CONTROL
    u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION

However the current approach has the advantage, that only those byte 
values that differ from the identical mapping have to be specified.

    Walter Dörwald

More information about the Python-Dev mailing list