[Python-3000] Support for PEP 3131

Sat Jun 2 06:14:21 CEST 2007

On 5/27/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>James Y Knight writes:
>> a 'pyidchar.txt' file with a list of character ranges, and now that
>> pyidchar.txt file is going to have separate sections based on module
>> name? Sorry, but are you !@# kidding me?!?
>
>The scalability issue was raised by Guido, not the ASCII advocates.

He did not say that such files or command-line options would be
scalable either. They are fine tools for auditing, but not for using
finished products. One should provide both auditing tools and ease
of use of already audited code.

One possibility for providing both:

(1) Add a mandatory ASCII-only special comment at the beginning of
    each module. The comment would continue until the first empty
    line and would contain only valid directives matching some
    regular expression. Only whitespace is allowed before the
    comment. Anything else is a syntax error.
(2) Allow directives in the special comment to change encoding and
    tab/space rules. Also allow them to restrict the identifier
    character set and the string character set.
(3) Defaults: utf-8 encoding, no mixed tabs and spaces, identifier
    and string content is not restricted. (beyond the restrictions
    in PEP 3131 etc. which the user can't lift, of course) One could
    change these in site.py, but the directives in (2) override
    the defaults, so they can't be used for B&D.
(4) Have a command line parameter for restricting the character sets
    of all modules. Every module must satisfy both this and its own
    directives simultaneously. A default value for this could be set
    in site.py, but it must be immutable after first assignment.

This way everything "just works" for quick hacks and for naive users
who only run code they trust. For real projects it's easy to add a couple
of lines in modules to enforce project policy. When you see code
that doesn't specify a character set you trust, then you know you
may have to be careful.

If you don't want to be careful, then you can set the command line
parameter default to e.g. ascii in site.py and nothing using
non-ascii identifiers will work for you. If you're fine with
explicit charset tags but not implicit ones, then you can set the
defaults for tagless modules to ascii in site.py.

Example 1 (the defaults, implicit):

#!/usr/bin/env python

# Real code starts here. This comment is not special and you
# can even usë whätëvër chäräctërs yöü wänt tö hërë.

Example 2 (the defaults, explicit):

#!/usr/bin/env python
#
# coding: utf-8
# identifier_charset: 0-1fffff
# string_charset: 0-1fffff
# indentation: unmixed

# Real code.

Example 3 (strawman for some Japanese code):

# identifier_charset:
#     0-7f 3b1-3c9 "Hiragana" "Katakana" "CJK Unified Ideographs"
#     "CJK Unified Ideographs Extension A"
#     "CJK Unified Ideographs Extension B"

# The range 3b1-3c9 is lowercase Greek, which is often used in math.

Example 3 (inclusion from a file, similar to import):

# identifier_charset: fooproject.codingstyle.identifier_charset