Python's 8-bit cleanness deprecated?

Tue Feb 4 12:46:40 EST 2003

On Tue, Feb 04, 2003 at 01:36:04PM +0100, Just wrote:
> In article <Z9M%9.174650$AA2.6989950 at news2.tin.it>,
>  Alex Martelli <aleax at aleax.it> wrote:
> 
> > > I think raw 8bit must be set by default without any warnings.
> > 
> > I disagree, but not hotly -- I'll be quite content with
> > whatever warning strategy ends up being adopted; say
> > I'm a +0 on the choice made for 2.3alpha.  But be warned
> > that you'll have to argue against hotly +1 people --
> > check the python-dev archives to hone your arguments.
> > (Arguing here is not much use of course, since Guido
> > doesn't read c.l.py currently).
> 
> Here's a possible compromise (which I'm not sure is implementable at 
> all): Python could only issue warnings if 8-bit chars are used in string 
> literals, and not if they only occur in comments.

What makes you believe that Python can tell what is a comment and what
is a string without knowing the encoding?

I think the only limitation of the source file encoding is that it must
be an ASCII superset.  So for instance I could have a perverse encoding
where 0x81 decodes to u'\n', and 0x83 is another valid character in the
encoding
's'.  Then this byte string
    '#\x81"\x83"\x81'
actually decodes to
    u'#\n"\uXXXX"\n"
which means the file contains a string with high-bit-set chars used in
a string literal.

If there is also a requirement that the encoding be capable of doing a
round-trip unchanged (eg s.decode("perverse").encode("perverse") == s 
with s = "".join([chr(x) for x in range(256)])) then perhaps your idea
is a "safe" one.  In that case the encoding can't map two values both
onto \n, the key to my example.

Jeff