PEP 263 spec (was: Proposal: require 7-bit source str's)

Sat Aug 7 15:32:24 EDT 2004

Martin v. Löwis wrote:
>Hallvard B Furuseth wrote:
>
>> - For a number of source encodings (like utf-8:-) it should be easy
>>   to parse and charset-convert in the same step, and only convert
>>   selected parts of the source to Unicode.
>
> Correct. However, that it works "for a number of source encodings"
> is insufficient - if it doesn't work for all of them, it only 
> unreasonably complicates the code.

For UTF-8 source, the complication might simply be to not call a charset
conversion routine.  For some other character sets - well, fixing the
problem below would probably introduce that complication anyway.

> For some source encodings (namely the CJK ones), conversion to UTF-8
> is absolutely necessary even for proper lexical analysis, as the
> byte that represents a backslash in ASCII might be the first byte
> of a two-byte sequence.

No.  It's necessary to convert the source file to logical characters
and feed those to the parser in some way, and conversion to UTF-8 in
a simple way to do that.

I think the 'right way', as far as source character set handling is
concerned, would be to have the source reader and the language parser
cooperate:  The reader translates the source file to logical source
characters which it feeds to the parser (UTF-8 is fine for that), and
the parser notifies the reader when it sees the start and end of a
source character string which should be given to the parser in its
original form (by some other means than feeding it to the parser as if
it was charset-converted source code, of course).

Now, that might conflict with Python's design goals, if it is supposed
to be possible to keep the reading and parsing steps separate.  Or it
might just take more effort to rearrange the code than anyone is
interested in doing.  But in either case it still looks like a bug to
me, even if it's at best a low-priority one.

>> - I think the spec is buggy anyway.  Converting to Unicode and back
>>   can change the string representation.  But I'll file a separate
>>   bug report for that.
> 
> That is by design. The only effect of such a bug report will be that
> the documentation clearly clarifies that.

OK, I'll make it a doc bug.

-- 
Hallvard