[Python-Dev] Unicode charmap decoders slow
Walter Dörwald
walter at livinglogic.de
Thu Oct 6 09:28:05 CEST 2005
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>> OK, here's a patch that implements this enhancement to
>> PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
>
> Looks nice!
>
>> Creating the decoding_map as a string should probably be done by
>> gencodec.py directly. This way the first import of the codec would be
>> faster too.
>
> Hmm. How would you represent the string in source code? As a Unicode
> literal? With \u escapes,
Yes, simply by outputting repr(decoding_string).
> or in a UTF-8 source file?
This might get unreadable, if your editor can't detect the coding header.
> Or as a UTF-8
> string, with an explicit decode call?
This is another possibility, but is unreadable too. But we might add the
real codepoints as comments.
> I like the current dictionary style for being readable, as it also
> adds the Unicode character names into comments.
We could use
decoding_string = (
u"\u009c" # 0x0004 -> U+009C: CONTROL
u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION
...
)
However the current approach has the advantage, that only those byte
values that differ from the identical mapping have to be specified.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list