"Martin v. Loewis" wrote:
Guido van Rossum email@example.com writes:
This makes Latin-1 the right choice:
But they shouldn't, IMO.
I agree. I recommend to deprecate this feature, and raise a DeprecationWarning if a Unicode literal contains non-ASCII characters but no encoding has been declared.
Sorry, I don't understand what you're trying to say here. Can you explain this with an example? Why can't we require any program encoded in more than pure ASCII to have an encoding magic comment? I guess I don't understand why you mean by "raw binary".
With the proposed implementation, the encoding declaration is only used for Unicode literals. In all other places where non-ASCII characters can occur (comments, string literals), those characters are treated as "bytes", i.e. it is not verified that these bytes are meaningful under the declared encoding.
Marc's original proposal was to apply the declared encoding to the complete source code, but I objected claiming that it would make the tokenizer changes more complex, and the resulting tokenizer likely significantly slower (atleast if you use the codecs API to perform the decoding).
I don't think that the codecs will significantly slow down overall compilation -- the compiler is not fast to begin with.
However, changing the bsae type in the tokenizer and compiler from char to Py_UNICODE will be a significant effort and that's why I added two phases to the implementation.
The first phase will only touch Unicode literals as proposed by Martin.
In phase 2, the encoding will apply to all strings. So it will not be possible to put arbitrary byte sequences in a string literal, atleast if the encoding disallows certain byte sequences (like UTF-8, or ASCII). Since this is currently possible, we have a backwards compatibility problem.
Right and I believe that a lot of people in European countries write strings literals with a Latin-1 encoding in mind. We cannot simply break all that code.
The other problem is with comments found in Python source code. In phase 2 these will break as well.
So how about this:
In phase 1, the tokenizer checks the complete file for non-ASCII characters and outputs single warning per file if it doesn't find a coding declaration at the top. Unicode literals continue to use [raw-]unicode-escape as codec.
In phase 2, we enforce ASCII as default encoding, i.e. the warning will turn into an error. The [raw-]unicode-escape codec will be extended to also support converting Unicode to Unicode, that is, only handle escape sequences in this case.
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH