[Python-3000] PEP 3131 accepted
python at zesty.ca
Sun May 27 03:19:50 CEST 2007
On Sat, 26 May 2007, Michael Urman wrote:
> On 5/26/07, Ka-Ping Yee <python at zesty.ca> wrote:
> > But the enabling of UTF-8 by a BOM at the
> > beginning of the file is an invisible override. This invisible
> > override is the source of the danger. If we want to be able to
> > read the coding declaration with any confidence, we should get rid
> > of the invisible override.
> Do we need to reconsider PEP 3120 "Using UTF-8 as the default source
> encoding"? I don't see much difference between not knowing on visual
> inspection whether:
> allowed is allowed
> "allowed" == "allowed"
The concern is similar in nature, but there is a difference. It is
more feasible to tell programmers not to trust the visual appearance
of strings than to tell them not to trust the visual appearance of
identifiers. Strings are data, which makes them separable from the
structure and logic of a program, whereas identifiers are fundamental
to all programs. Programmers are already trained to understand that
string literals in source code are non-verbatim representations (e.g.
"it's" == 'it\'s' == 'it' "'s" == "\x69t's"), whereas they have a well
established expectation that identifiers are written verbatim.
As long as you have a way of distinguishing strings reliably from the
rest of the source code, you can know whether your confidence is well
placed. Blake's example illustrates that ambiguity in strings is
especially dangerous because it can obscure where strings begin and end.
PEP 3120 is problematic. At the very least, it is definitely missing
a section addressing objections (the problem of not being able to
understand an expression like "allowed" == "allowed") and a section
on security considerations (like those raised by Blake's example).
Since that the default encoding is currently ASCII, almost all Python
programmers are unlikely to be prepared for ambiguity in strings;
thus the best thing to do would be to keep the default as ASCII and
require a visible declaration to activate such ambiguity (enable UTF-8).
Failing that, the next best thing to do would be to forbid all
confusable characters without an explicit declaration to permit them.
And the next best thing after that would be to forbid just the
characters that are confusable with the delimiters that fence off
ambiguous text (' " #) without an explicit declaration to permit them.
More information about the Python-3000