PEP 263 comments

Tue Feb 26 05:06:45 EST 2002

"Martin v. Loewis" wrote:
> 
> To make some progress on PEP 263, I suggest that some of the open issues
> are resolved as follows:

Thanks for the comments. I've update the PEP at SourceForge...

> - Comment syntax: I suggest to use the form
>   -*- coding: <coding name> -*-
>   Emacs already recognizes this syntax, as does patch #508973
>   on IDLEfork. The other proposed syntaxes should be removed from the
>   PEP.

+1

> - In addition, to simplify usage on Windows, Python recognizes the
>   UTF-8 file signature (e.g. as generated by notepad). Any file
>   starting with \xef\xbb\xbf is treated as being UTF-8; a coding
>   comment different from "utf-8" in such a file is an error.

+1

> - identifiers remain restricted to ASCII

+1

> - Implementation strategy: I believe the proposed strategy (change the
>   tokenizer) is overly complicated, and likely inefficient. Instead, I
>   suggest that the encoding directive applies only to Unicode literals.
>   It will still be formally an error if comments or string literals do
>   not follow the declared encoding, but the Python parser won't detect
>   this error.
> 
>   For use in Unicode literals, the parser will continue to work as it
>   does now, except that it applies the declared coding in compile.c.
>   To do so, PyUnicode_DecodeRawUnicodeEscape and
>   PyUnicode_DecodeUnicodeEscape will expect an additional flag
>   indicating whether they operate on a char* or a Py_UNICODE*.
> 
>   The only problem with this approach is that encodings where " or '
>   could be the second byte of a multi-byte character cannot be
>   supported as a source encoding. Python supports no such encoding
>   in the standard library at the moment, anyway, so this should not
>   be a problem.

I've added a two phase approach to the PEP: first we only
handle Unicode literals, then we do the whole file in a later
step.

> - Backwards compatibility: I'm in favour of leaving mostly everything
>   as-is, i.e. if there is no declared encoding, it should be possible
>   to put arbitrary bytes in string literals and comments; the proposed
>   implementation strategy supports that. However, I think that Unicode
>   literals which use the Latin-1 fallback should be deprecated, and that
>   the implementation should raise a DeprecationWarning: Anybody relying
>   on that feature should declare that the encoding is Latin-1.

Python will have to use Latin-1 as fallback encoding anyway,
so I don't think it's worth the trouble...

> - Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
>   signature. If no UTF-8 signature is found, it shall look for the
>   coding comment. If none is found, it shall apply the locale's
>   coding, which is determined as follows:
>   - on windows, it is "mbcs"
>   - on Unix, it is the one returned by nl_langinfo(CODESET)
>   Otherwise, it is the system default encoding.
> 
>   When saving a file, IDLE shall preserve the UTF-8 signature if there
>   was one. If not, and if there is a coding comment, that should be
>   used to encode the file. If there is none, the locale's encoding
>   should be used. If encoding fails (whether the coding was found in
>   the comment or in the locale), the file shall be UTF-8 encoded, and
>   an UTF-8 signature added.

I did not add the IDLE changes to the PEP. Please upload them
as feature request to SF.

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/