[Python-Dev] uPEP: encoding directive

Wed, 19 Jul 2000 00:52:11 +0200

> [paul]
> > Also, is it really necessary to allow raw non-ASCII characters in =
source
> > code though? We know that they aren't portable across editing
> > environments, so one person's happy face will be another person's =
left
> > double-dagger.
>
> [me]
> I suppose changing that would break code.  maybe it's time
> to reopen the "pragma encoding" thread?
>=20
> (I'll dig up my old proposal, and post it under a new subject).

as brief as I can make it:

1. add support for "compiler directives".  I suggest the following
syntax, loosely based on XML:

    #?python key=3Dvalue [, key=3Dvalue ...]

(note that "#?python" will be treated as a token after this change.
if someone happens to use comments that start with #?python,
they'll get a "SyntaxError: bad #?python compiler directive"...)

2. for now, only accept compiler directives if they appear before
the first "real" statement.

3. keys are python identifiers (NAME tokens), values are simple
literals (STRING, NUMBER)

4. key/value pairs are collected in a dictionary.

5. for now, we only support the "encoding" key.  it is used to
determine how string literals (STRING tokens) are converted
to string or unicode string objects.

6. the encoding value can be any of:

"undefined" or not defined at all:

    plain string: copy source characters as is

    unicode string: expand 8-bit source characters to
    unicode characters (i.e. treat them as ISO Latin 1)

"ascii"

    plain string: characters in the 128-255 range gives
    a SyntaxError (illegal character in string literal).

    unicode string: same as for plain string

any other ascii-compatible encoding (the ISO 8859 series,
Mac Roman, UTF-8, and others):

    plain string: characters in the 128-255 range gives
    a SyntaxError (illegal character in string literal).

    unicode string: characters in the 128-255 range are
    decoded, according to the given encoding.
    string has been decoded,=20

any other encoding (UCS-2, UTF-16)

    undefined (or SyntaxError: illegal encoding)

to be able to flag this as a SyntaxError, I assume we can
add an "ASCII compatible" flag to the encoding files.

7. only the contents of string literals can be encoded.  the
tokenizer still works on 7-bit ASCII (hopefully, this will change
in future versions).

8. encoded string literals are decoded before Python looks
for backslash escape codes.

I think that's all.

Comments?  I've looked at the current implementation rather
carefully, and it shouldn't be that hard to come up with patches
that implement this scheme.

</F>