[Python-3000] PEP 3120 (Was: PEP Parade)

Thu May 3 09:19:04 CEST 2007

>  S  3120  Using UTF-8 as the default source encoding   von Löwis
> 
> The basic idea seems very reasonable. I expect that the changes to the
> parser may be quite significant though. Also, the parser ought to be
> weened of C stdio in favor of Python's own I/O library. I wonder if
> it's really possible to let the parser read the raw bytes though --
> this would seem to rule out supporting encodings like UTF-16. Somehow
> I wonder if it wouldn't be easier if the parser operated on Unicode
> input? That way parsing unicode strings (which we must support as all
> strings will become unicode) will be simpler.

Actually, changes should be fairly minimal. The parser already
transforms all input (no matter what source encoding) to UTF-8
before doing the parsing; this has worked well (as all keywords
continue to be one-byte characters). The parser also already
special-cases UTF-8 as the input encoding, by not putting it
through a codec. That can also stay, except that it should now
check that any non-ASCII bytes are well-formed UTF-8.

Untangling the parser from stdio - sure. I also think it would
be desirable to read the whole source into a buffer, rather than
applying a line-by-line input. That might be a bigger change,
making the tokenizer a multi-stage algorithm:
1. read input into a buffer
2. determine source encoding (looking at a BOM, else a
   declaration within the first two lines, else default
   to UTF-8)
3. if the source encoding is not UTF-8, pass it through
   a codec (decode to string, encode to UTF-8). Otherwise,
   check that all bytes are really well-formed UTF-8.
4. start parsing

As for UTF-16: the lexer currently does not support UTF-16
as a source encoding, as we require an ASCII superset.

I'm not sure whether UTF-16 needs to be supported as a
source encoding, but with above changes, it would be fairly
easy to support, assuming we detect UTF-16 from the BOM
(can't use the encoding declaration, because that works
only for ASCII supersets).

Regards,
Martin