[Python-Dev] Support of UTF-16 and UTF-32 source encodings
Stephen J. Turnbull
stephen at xemacs.org
Sun Nov 15 11:42:12 EST 2015
Random832 writes:
> "Stephen J. Turnbull" <stephen at xemacs.org> writes:
> > I don't see any good reason for allowing non-ASCII-compatible
> > encodings in the reference CPython interpreter.
>
> There might be a case for having the tokenizer not care about encodings
> at all and just operate on a stream of unicode characters provided by a
> different layer.
That's exactly what the PEP 263 implementation does in Python 2 (with
the caveat that Python 2 doesn't know anything about Unicode, it's a
UTF-8 stream and the non-ASCII characters are treated as bytes of
unknown semantics, so they can't be used in syntax). I don't know
about Python 3, I haven't looked at the decoding of source programs.
But I would assume it implements PEP 263 still, except that since str
is now either widechars or PEP 393 encoding (ie, flexible widechars)
that encoding is now used instead of UTF-8.
I'm sure that there are plenty of ASCII-isms in the tokenizer in the
sense that it assumes the ASCII *character* (not byte) repertoire.
But I'm not sure why Serhiy thinks that the tokenizer cares about the
representation on-disk. But as I say, I haven't looked at the code so
he might be right.
Steve
More information about the Python-Dev
mailing list