[Python-ideas] Strings can sometimes convert to bytes without an encoding

Tue Jun 14 21:06:35 EDT 2016

On Tue, Jun 14, 2016 at 07:46:34PM -0400, Franklin? Lee wrote:
> On Tue, Jun 14, 2016 at 7:26 PM, Guido van Rossum <guido at python.org> wrote:
> > -1. Such a check for the contents of the string sounds exactly like the
> > Python 2 behavior we are trying to get away with.
> 
> But isn't it really just converting back and forth between two
> representations of the same thing? A str with char width 1 is
> conceptually an ASCII string; 

How do you reason that? There are many one-character strings that aren't 
ASCII, and encode to two or more bytes.

py> s = '\N{CJK UNIFIED IDEOGRAPH-4E4D}'
py> len(s)
1
py> s.encode('big5')
b'\xa5E'

Other multibyte encodings include UTF-16 and UTF-32. Even single-byte 
encodings are not necessarily based on ASCII, e.g. the various EBCDIC 
codecs.

The only string which is likely to return the same bytes under ALL 
encodings is the empty string. And even that is not guaranteed: imagine 
that somebody uses the codec machinary to create a string to bytes 
transformation which is not a pure character translation, e.g. a 
string-to-bytes version of:

py> codecs.encode(b'', 'zip')
b'x\x9c\x03\x00\x00\x00\x00\x01'

Even ignoring such exotic transformations, it's not worth special casing 
the single case of '' --> b''. Who is going to write ''.encode() when 
they could just write b''?

> you're just changing how it's exposed to
> the program.
> 
> As it stands, when you have an ASCII string stored as a str, you can
> call str.encode() on it (whereby it will default to encoding='utf-8'),
> or you can call `bytes(s, 'utf-8')`, and pass in an argument which is
> conceptually ignored. (Unless it is in fact not an ASCII string!) 

Your comment in parentheses is the crux of the matter: your string may 
not be an ASCII string, so you don't know if the encoding is 
conceptually ignored.

Besides, it may not be ignored: the FSR is a CPython implementation 
detail, not a language guarantee. 'A' is not necessarily stored 
internally as the single byte 0x41. Obvious alternatives are UTF-16 (two 
bytes) or UTF-32 (four bytes). I expect that Jython and IronPython will 
use whatever Java and .Net use for strings, which is unlikely to be the 
same as what CPython does.

> On
> the other hand, `bytes(s)` means, "Encoding shall not be necessary."
> That could be semantically useful, and a non-ASCII string will trigger
> an exception, while the other methods will just encode.

Encode how? Just by copying bytes from whatever internal representation 
the Python compiler happens to use? Anything else requires a codec.

You're suggesting raising an exception on non-ASCII strings, so either 
strings need an internal flag to say whether they're ASCII or not 
(possibly a good idea regardless), or bytes(s) has to scan the string, 
not just copy it.

To my mind, it sounds like giving bytes a default encoding of 'ascii' 
might satify you. Or you can write a wrapper:

def bytes(s):
    return builtins.bytes(s, 'ascii')

-- 
Steve