[Python-Dev] Support of UTF-16 and UTF-32 source encodings

Stephen J. Turnbull stephen at xemacs.org
Sun Nov 15 11:42:12 EST 2015


Random832 writes:
 > "Stephen J. Turnbull" <stephen at xemacs.org> writes:
 > > I don't see any good reason for allowing non-ASCII-compatible
 > > encodings in the reference CPython interpreter.
 > 
 > There might be a case for having the tokenizer not care about encodings
 > at all and just operate on a stream of unicode characters provided by a
 > different layer.

That's exactly what the PEP 263 implementation does in Python 2 (with
the caveat that Python 2 doesn't know anything about Unicode, it's a
UTF-8 stream and the non-ASCII characters are treated as bytes of
unknown semantics, so they can't be used in syntax).  I don't know
about Python 3, I haven't looked at the decoding of source programs.
But I would assume it implements PEP 263 still, except that since str
is now either widechars or PEP 393 encoding (ie, flexible widechars)
that encoding is now used instead of UTF-8.

I'm sure that there are plenty of ASCII-isms in the tokenizer in the
sense that it assumes the ASCII *character* (not byte) repertoire.
But I'm not sure why Serhiy thinks that the tokenizer cares about the
representation on-disk.  But as I say, I haven't looked at the code so
he might be right.

Steve



More information about the Python-Dev mailing list