[Python-checkins] CVS: python/nondist/peps pep-0263.txt,1.6,1.7

Wed, 27 Feb 2002 03:07:18 -0800

Update of /cvsroot/python/python/nondist/peps
In directory usw-pr-cvs1:/tmp/cvs-serv5930

Modified Files:
	pep-0263.txt 
Log Message:
Changes regarding the default encoding and other minor tweaks.
See history for details.

Index: pep-0263.txt
===================================================================
RCS file: /cvsroot/python/python/nondist/peps/pep-0263.txt,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** pep-0263.txt	26 Feb 2002 20:26:07 -0000	1.6
--- pep-0263.txt	27 Feb 2002 11:07:16 -0000	1.7
***************
*** 40,45 ****
  Defining the Encoding

!     Python will default to Latin-1 as standard encoding if no other
!     encoding hints are given.

      To define a source code encoding, a magic comment must
--- 40,47 ----
  Defining the Encoding

!     Just as in coercion of strings to Unicode, Python will default to
!     the interpreter's default encoding (which is ASCII in standard
!     Python installations) as standard encoding if no other encoding
!     hints are given.

      To define a source code encoding, a magic comment must
***************
*** 50,53 ****
--- 52,60 ----
            # -*- coding: <encoding name> -*-

+     More precise, the first or second line must match the regular
+     expression "coding[:=]\s*([\w-_]+)". The first group of this
+     expression is then interpreted as encoding name. If the encoding
+     is unknown to Python, an error is raised during compilation.
+ 
      To aid with platforms such as Windows, which add Unicode BOM marks
      to the beginning of Unicode files, the UTF-8 signature
***************
*** 67,71 ****
         Embedding of differently encoded data is not allowed and will
         result in a decoding error during compilation of the Python
!        source code. 

         Only ASCII compatible encodings are allowed as source code
--- 74,78 ----
         Embedding of differently encoded data is not allowed and will
         result in a decoding error during compilation of the Python
!        source code.

         Only ASCII compatible encodings are allowed as source code
***************
*** 102,115 ****
         subset of the encoding.

-     For backwards compatibility, the implementation must assume
-     Latin-1 as the original file encoding if not given (otherwise,
-     binary data currently stored in 8-bit strings wouldn't make the
-     roundtrip).
- 
  Implementation

      Since changing the Python tokenizer/parser combination will
!     require major changes in the internals of the interpreter, the
!     proposed solution should be implemented in two phases:

      1. Implement the magic comment detection and default encoding
--- 109,120 ----
         subset of the encoding.

  Implementation

      Since changing the Python tokenizer/parser combination will
!     require major changes in the internals of the interpreter and
!     enforcing the use of magic comments in source code files which
!     place non-default encoding characters in string literals, comments
!     and Unicode literals, the proposed solution should be implemented
!     in two phases:

      1. Implement the magic comment detection and default encoding
***************
*** 117,133 ****
         literals in the source file.

      2. Change the tokenizer/compiler base string type from char* to
         Py_UNICODE* and apply the encoding to the complete file.

  Scope

!     This PEP only affects Python source code which makes use of the
!     proposed magic comment. Without the magic comment in the proposed
!     position, Python will treat the source file as it does currently
!     (using the Latin-1 encoding assumption) to maintain backwards
!     compatibility.

  History

      1.3: Worked in comments by Martin v. Loewis: 
           UTF-8 BOM mark detection, Emacs style magic comment,
--- 122,153 ----
         literals in the source file.

+        In addition to this step and to aid in the transition to
+        explicit encoding declaration, the tokenizer must check the
+        complete source file for compliance with the default encoding
+        (which usually is ASCII). If the source file does not properly
+        decode, a single warning is generated per file.
+ 
      2. Change the tokenizer/compiler base string type from char* to
         Py_UNICODE* and apply the encoding to the complete file.

+        Source files which fail to decode cause an error to be raised
+        during compilation.
+ 
+        The builtin compile() API will be enhanced to accept Unicode as
+        input. 8-bit string input is subject to the standard procedure
+        for encoding detection as decsribed above.
+ 
  Scope

!     This PEP intends to provide an upgrade path from th current
!     (more-or-less) undefined source code encoding situation to a more
!     robust and portable definition.

  History

+     1.7: Added warnings to phase 1 implementation. Replaced the
+          Latin-1 default encoding with the interpreter's default
+          encoding. Added tweaks to compile().
+     1.4 - 1.6: Minor tweaks
      1.3: Worked in comments by Martin v. Loewis: 
           UTF-8 BOM mark detection, Emacs style magic comment,
***************
*** 138,146 ****
      This document has been placed in the public domain.

- 

  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil
- fill-column: 70
  End:
--- 158,164 ----