[Patches] [ python-Patches-534304 ] PEP 263 Implementation

Thu, 09 May 2002 06:42:54 -0700

Patches item #534304, was opened at 2002-03-24 14:52
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=534304&group_id=5470

Category: Parser/Compiler
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: SUZUKI Hisao (suzuki_hisao)
Assigned to: Nobody/Anonymous (nobody)
>Summary: PEP 263 Implementation

Initial Comment:
This is a sample implementation of PEP 263 phase 2.

This implementation behaves just as normal Python does
if no other coding hints are given.  Thus it does not
hurt anyone who uses Python now.  Note that it is
strictly compatible with the PEP in that every program
valid in the PEP is also valid in this implementation.

This implementation also accepts files in UTF-16 with
BOM.  They are read as UTF-8 internally.  Please try
"utf16sample.py" included.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2002-05-09 15:42

Message:
Logged In: YES 
user_id=21627

I have now updated this patch to the current CVS, and to be
a complete PEP 263 implementation; it will issue warnings
when it finds non-ASCII characters but no encoding declaration.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-26 21:41

Message:
Logged In: YES 
user_id=21627

I've updated the PEP to describe how this approach should be
used: Python 2.3 still should generate warnings only for
using non-ASCII without declared encoding. I, too, hope that
Mr Suzuki will update the patch to match the PEP, and for
the CVS tree.

As for supporting UTF-16: The stream reader currently has
the .readline method disabled, since it won't work reliable
for little-endian. So I think this should be an undocumented
feature at the moment; I see no other technical problems
with the approach taken in the patch.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-04-23 23:26

Message:
Logged In: YES 
user_id=6380

I haven't looked at this very carefully, but it looks like
it's well thought-out.

Suzuki, can you prepare a patch relative to current CVS?  I
get several patch failures now. (Fortunately I have a
checkout of 2.2 so I can still review and test the patch.)
I don't know what the patch failures are about (haven't
investigated) but imagine it might have to do with the PEP
279 (universal newlines) changes checked in by Jack Jansen,
which replaces the tokenizer's fgets() calls with calls to
Py_UniversalNewlineFgets().

Also, I can't read the README file (it's in Japanese :-).
What is the expected output from the samples? For me,
sjis_sample.py gives SyntaxError: 'unknown encoding'

Martin, I'm unclear of how you intend to use this code. Do
you intend to go straight to phase 2 of the PEP using this
patch? Or do you intend to implement phase 1 of the PEP by
modifying this code?

Also, does the PEP describe the UTF-16 support as
implemented by Suziki's patch?

----------------------------------------------------------------------

Comment By: SUZUKI Hisao (suzuki_hisao)
Date: 2002-03-31 18:16

Message:
Logged In: YES 
user_id=495142

Thank you for your review.
Now 1. and 3. are fixed, and 2. is improved.
(4. is not true.)

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2002-03-30 12:27

Message:
Logged In: YES 
user_id=6656

Not going into 2.2.x.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-03-25 14:23

Message:
Logged In: YES 
user_id=21627

The patch looks good, but needs a number of improvements.

1. I have problems building this code. When trying to build
pgen, I get an error message of

Parser/parsetok.c: In function `parsetok':
Parser/parsetok.c:175: `encoding_decl' undeclared

The problem here is that graminit.h hasn't been built yet,
but parsetok refers to the symbol.

2. For some reason, error printing for incorrect encodings
does not work - it appears that it prints the wrong line in
the traceback.

3. The escape processing in Unicode literals is incorrect.
For example, u"\<non-ascii character>" should denote only
the non-ascii character. However, your implementation
replaces the non-ASCII character with \u<hex>, resulting in
\u<hex>, so the first backslash unescapes the second one.

4. I believe the escape processing in byte strings is also
incorrect for encodings that allow \ in the second byte.
Before processing escape characters, you convert back into
the source encoding. If this produces a backslash character,
escape processing will misinterpret that byte as an escape
character.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=534304&group_id=5470