Proposal: require 7-bit source str's
"Martin v. Löwis"
martin at v.loewis.de
Sat Aug 7 00:16:41 CEST 2004
Hallvard B Furuseth wrote:
> - For a number of source encodings (like utf-8:-) it should be easy
> to parse and charset-convert in the same step, and only convert
> selected parts of the source to Unicode.
Correct. However, that it works "for a number of source encodings"
is insufficient - if it doesn't work for all of them, it only
unreasonably complicates the code.
For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.
> - I think the spec is buggy anyway. Converting to Unicode and back
> can change the string representation. But I'll file a separate
> bug report for that.
That is by design. The only effect of such a bug report will be that
the documentation clearly clarifies that. Users that need to make
sure the run-time representation of a string is the same of as the
source representation need to pick a source encoding that round-trips.
> Sorry, I thought you were speaking of promising a __future__ when all
> string literals are required to be 7-bit or u'' literals.
Yes, but that *will* cause a wide debate. Say, Python 3.5, to be
release 2017 or so. I could live with such a language, but I'm
certain many users can't, in any foreseeable future.
More information about the Python-list