
[GvR]
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
[JvR]
How will other parts of a program know which encoding was used for non-unicode string literals?
It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding!
Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information, and there are too many sources of strings with unknown encodings to make it very useful. Plus, it would slow down 8-bit string ops. I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code. If the source encoding is known, these will be converted using the appropriate codec. If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"..."). But I think this should be enabled by a separate pragma -- people who want to write Unicode-unaware code manipulating 8-bit strings in their favorite encoding (e.g. shift-JIS or Latin-1) should not silently get Unicode strings. (I thought about an option to make *all strings* (not just literals) Unicode, but the current implementation would require too much hacking. This is what JPython does, and maybe it should be what Python 3000 does; I don't see it as a realistic option for the 1.x series.) --Guido van Rossum (home page: http://www.python.org/~guido/)