PEP 263 comments

Mon Mar 4 16:50:37 EST 2002

On 01 Mar 2002 22:20:47 +0100, martin at v.loewis.de (Martin v. Loewis) wrote:

>Skip Montanaro <skip at pobox.com> writes:
>
>> Just to make sure I understand correctly, under Stephen's propsal would
>> 
>>     s = "\xff"
>> 
>> be correct?  I assume
>> 
>>     s = "ÿ"
>> 
>> (a literal 0377 character) would be an error, yes?  
>
>Yes, on both accounts.
>
>> That is, when you saw "arbitrary binary data" you are referring to
>> non-printable octets in the source file, right?
>
>Correct (except that whether something is printable is in the eye of
>the beholder). On the source level, the four letters '\', 'x', 'f',
>'f' are not arbitrary binary - they follow a specific syntax.
>
>I actually doubt anybody is putting "arbitrary binary" data into
>source code. Instead, most such occurrences are likely "printable", if
>viewed in the encoding of the author of that code. Those would be
>outlawed, unless that encoding is UTF-8.
>
However, people will be wanting to put arbitrary data in strings, and
I think having 'xxx' define something other than 3 octets should be
separately controlled from whether the source is encoded in UTF-8 or
UTF-16 or whatever. Otherwise open('filexxx','wb').write('xxx') may
lose its meaning just because someone imported it into an editor
and wrote it out as UTF-8 for whatever reason (e.g., not intending to
change anything other than adding some comments in Japanese).

It may be convenient for some programs to have 'xxx' default to internal
UTF-8, but that means strings are strictly utf-8 sequences, and passing
'xxx' to .write() would probably imply passing the UTF-8 bytes.

So if you wrote open('fileyyy','wb').write('\xff') you would expect
to find the UTF-8 sequence, not one chr(0xff) -- oops, what is chr()
going to do? Shouldn't '#'+chr(0xff)+'#' be == '#\xff#' ?

I think changing the internal representation of unqualified-by-u quoted
strings to UTF-8 would be a very radical step, even as a controllable
option. To make it just blindly match source encoding would be more than radical.

With internal UTF-8 ordinary-string default encoding, I think there would
be a need for a plain old octet string as a different type (classical string?),
maybe corresponding to what we get now with latin-1 encoding and rendering,
for convenience. e.g., l'xxx' and maybe L'xxx' for the raw variant?

Regards,
Bengt Richter