
Martin v. Löwis wrote:
Walter Dörwald wrote:
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
Looks nice!
Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.
Hmm. How would you represent the string in source code? As a Unicode literal? With \u escapes,
Yes, simply by outputting repr(decoding_string).
or in a UTF-8 source file?
This might get unreadable, if your editor can't detect the coding header.
Or as a UTF-8 string, with an explicit decode call?
This is another possibility, but is unreadable too. But we might add the real codepoints as comments.
I like the current dictionary style for being readable, as it also adds the Unicode character names into comments.
We could use decoding_string = ( u"\u009c" # 0x0004 -> U+009C: CONTROL u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION ... ) However the current approach has the advantage, that only those byte values that differ from the identical mapping have to be specified. Bye, Walter Dörwald