Guido van Rossum firstname.lastname@example.org writes:
This makes Latin-1 the right choice:
But they shouldn't, IMO.
I agree. I recommend to deprecate this feature, and raise a DeprecationWarning if a Unicode literal contains non-ASCII characters but no encoding has been declared.
Sorry, I don't understand what you're trying to say here. Can you explain this with an example? Why can't we require any program encoded in more than pure ASCII to have an encoding magic comment? I guess I don't understand why you mean by "raw binary".
With the proposed implementation, the encoding declaration is only used for Unicode literals. In all other places where non-ASCII characters can occur (comments, string literals), those characters are treated as "bytes", i.e. it is not verified that these bytes are meaningful under the declared encoding.
Marc's original proposal was to apply the declared encoding to the complete source code, but I objected claiming that it would make the tokenizer changes more complex, and the resulting tokenizer likely significantly slower (atleast if you use the codecs API to perform the decoding).
In phase 2, the encoding will apply to all strings. So it will not be possible to put arbitrary byte sequences in a string literal, atleast if the encoding disallows certain byte sequences (like UTF-8, or ASCII). Since this is currently possible, we have a backwards compatibility problem.