[Python-3000] Support for PEP 3131

Sat Jun 2 18:19:14 CEST 2007

On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> """
> If a comment in the first or second line of the Python script matches
> the regular expression coding[=:]\s*([-\w.]+), this comment is processed
> as an encoding declaration; the first group of this expression names the
> encoding of the source code file.
> """
>
> Your suggestion would unnecessarily change the semantics of the encoding
> declarations.  I would call this gratuitous breakage.

Depending on what the regular expression for the declarations is, the
difference may
not be big. Current code can also reliably be converted with an automated tool,
so this isn't a big deal for py3k.

It may be that the change is unnecessary. Reading Guido's writings, he seems
to be of the opinion that the Java way (no restrictions at all) is
right here, and
anything else can be delegated to pylint and similar tools.

> Sounds like the application of vim settings as a solution to a whole
> bunch of completely unrelated "problems" in Python (especially with 4
> space indents being the "one true way to indent" and the encoding
> declaration already being established).  Please keep your vim out of my
> Python ;) .

The encoding declaration stays mostly the same, I'm just suggesting adding
similar declarations for the identifier/string character sets and making them
deception-proof. You're probably right about the indentation stuff. If
you got rid
of all indentation-related options and simply forbade mixture of tabs and
spaces, I'd just say good riddance.

> And as stated by basically everyone, the only *sane* default is ascii
> identifiers.  Since the vast majority of users will have no use for
> unicode identifiers in the short or long term, making them the default
> is overzealous at best.

"Basically everyone" is not true, because it does not include Guido, who
matters the most. Some quotes from his latest posts on the topic:

Guido van Rossum (May 25):
:I still think such a command-line switch (or switches) is the wrong
:approach. What if I have *one* module that uses Cyrillic legitimately.
:A command-line switch would enable Cyrillic in *all* modules.

Guido van Rossum (May 25):
:On 5/24/07, Josiah Carlson <jcarlson at uci.edu> wrote:
:> Where else in Python have we made the default
:> behavior only desired or useful to 5% of our users?
:
:Where are you getting that statistic? This seems an extremely
:backwards, US-centric worldview.

Guido van Rossum (May 25):
:A more useful approach would seem to be a set of auditing tools that
:can be applied routinely to all new contributions (e.g. as a
:pre-commit hook when using a source control system), or to all code in
:a given directory, download, etc. I don't see this as all that
:different from using e.g. PyChecker of PyLint.
:
:While I routinely perform visual code inspections [...], I certainly don't see
:this as a security audit [...]. Scanning for stray non-ASCII characters is best
:left to automated tools.

Guido van Rossum (May 23):
:In particular very helpful was a couple of reports from the Java
:world, where Unicode letters in identifiers have been legal for a long
:time now. (JavaScript also supports this BTW.) The Java world has not
fallen apart,

Guido van Rossum (May 17):
:As I mentioned before, I don't expect either of these will be much of
:a concern. I guess tools like pylint could optionally warn if
:non-ascii characters are used.
:
:On 5/16/07, Jim Jewett <jimjjewett at gmail.com> wrote:
:> (1)  Security concerns.
:> (2)  Obscure bugs.

Summary of what I think Guido's saying (involves some interpretation):
 - always having no restrictions (the Java way) is not a problem in practice
 - because having no restrictions has worked well with Java, Python
should follow
 - any concerns can be adequately dealt solely with external tools
 - command line switches are a bad implementation of restriction management

It is the last one of these that I was addressing, as there was some demand
for restriction management (despite Guido's leave-it-to-pylint stance) but no
adequate proposal. The defaults are easily changed in any case.

> > # identifier_charset: fooproject.codingstyle.identifier_charset
>
> I really don't like the idea of adding a *different* import-like thing.
> We already have imports (that are evaluated at run time, not compile
> time), and due to their semantics, can't use a mechanism like the above.

I agree that import is problematic. This part could be omitted with the
rationale that it's more trouble than it's worth, and anyone who needs something
complicated can use pylint or similar. In the end, something like this
is what you'd
have most of the time in practice when you care about character sets:

# identifier_charset: 0-7f

# Real code.

When you have a file with Cyrillic, then it'd allow Cyrillic too. For
quick hacks
you could use this and everything would just work:

#!/usr/bin/env python

# Real code.

This isn't really anything more than a countermeasure against Ka-Ping's
tricky.py -exploit and addition of a real charset restriction method instead of
abusing the coding declaration for that (that would force you to use legacy
codings just to restrict the charsets, as pointed out a lot earlier here).

One more thing which might be removed from the suggestion is the command
line option and its associated site.py default. Such checking is more
appropriate
for pylint, and is probably of little use anyway. Either you trust the
files you're
importing in which case the characters they use does not make any difference,
or you don't, in which case you shouldn't be importing them at all and checking
their character sets will not help you at all. For audit purposes the comment
directives are enough as they can't deceive, and if you want to be
extra paranoid
you can use pylint to catch any surreptitious patches like in Guillaume's post.