Strings can sometimes convert to bytes without an encoding

Current behavior (3.5.1): >>> bytes('') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: string argument without an encoding Suggestion: If the character size is 1, the `bytes`/`bytearray` constructor doesn't need a specified encoding. High-level idea: If the string only has code points in range(128), encoding is optional (and useless anyway). The new error message could be TypeError: non-ASCII string argument without an encoding How: CPython strings store characters in an array, such that each character takes a single entry. With an entry per character, indexing is just a regular C array index operation. Since PEP 393, the size( )of the elements of the array is just the size needed for the largest character. Thus, CPython strings "know" whether or not they're ASCII. Other implementations without PEP 393 can do a scan of the code points to check the 0-127 condition during building. That means O(n) more checks, but in those implementations, the per-character checks are already necessary with an explicit encoding, since you'd need to see if that character needs encoding. (The `bytes` and ASCII-`str` could in fact share memory, given a few tweaks. But that's an implementation detail.)

On Tue, Jun 14, 2016 at 7:26 PM, Guido van Rossum <guido@python.org> wrote:
-1. Such a check for the contents of the string sounds exactly like the Python 2 behavior we are trying to get away with.
But isn't it really just converting back and forth between two representations of the same thing? A str with char width 1 is conceptually an ASCII string; you're just changing how it's exposed to the program. As it stands, when you have an ASCII string stored as a str, you can call str.encode() on it (whereby it will default to encoding='utf-8'), or you can call `bytes(s, 'utf-8')`, and pass in an argument which is conceptually ignored. (Unless it is in fact not an ASCII string!) On the other hand, `bytes(s)` means, "Encoding shall not be necessary." That could be semantically useful, and a non-ASCII string will trigger an exception, while the other methods will just encode.

On 2016-06-14 19:46, Franklin? Lee wrote:
""" Python 3.5.1+ (default, Mar 30 2016, 22:46:26) [GCC 5.3.1 20160330] on linux Type "help", "copyright", "credits" or "license" for more information.
I think you might mean "code point that will be encoded to a single byte under UTF-8" (i.e. those in the ASCII range), because there are many code points that can be encoded to single bytes under various legacy encodings like ISO-8859-[1-15], the KOI variants, and so on.
What makes you say that the "utf-8" argument to `bytes` is conceptually ignored? That is the encoding used to convert the Unicode code points to bytes! That's by no means the only option either; Python supports multiple ASCII-incompatible encodings like UTF-16, UTF-32, and (as I recall) Shift-JIS, etc. 'utf-8' isn't hardcoded here, either; the encoding used is the one reported by `sys.getdefaultencoding()`. Saying "Unless it is in fact not an ASCII string!" doesn't make any sense either; UTF-8 is still used to convert e.g. U+0041 LATIN CAPITAL LETTER A to the byte 0x41. Even if your text string only contains code points in the range [U+0000, U+007F], this is a very important semantic distinction, so a big -1. MMR...

On Tue, Jun 14, 2016 at 07:46:34PM -0400, Franklin? Lee wrote:
How do you reason that? There are many one-character strings that aren't ASCII, and encode to two or more bytes. py> s = '\N{CJK UNIFIED IDEOGRAPH-4E4D}' py> len(s) 1 py> s.encode('big5') b'\xa5E' Other multibyte encodings include UTF-16 and UTF-32. Even single-byte encodings are not necessarily based on ASCII, e.g. the various EBCDIC codecs. The only string which is likely to return the same bytes under ALL encodings is the empty string. And even that is not guaranteed: imagine that somebody uses the codec machinary to create a string to bytes transformation which is not a pure character translation, e.g. a string-to-bytes version of: py> codecs.encode(b'', 'zip') b'x\x9c\x03\x00\x00\x00\x00\x01' Even ignoring such exotic transformations, it's not worth special casing the single case of '' --> b''. Who is going to write ''.encode() when they could just write b''?
Your comment in parentheses is the crux of the matter: your string may not be an ASCII string, so you don't know if the encoding is conceptually ignored. Besides, it may not be ignored: the FSR is a CPython implementation detail, not a language guarantee. 'A' is not necessarily stored internally as the single byte 0x41. Obvious alternatives are UTF-16 (two bytes) or UTF-32 (four bytes). I expect that Jython and IronPython will use whatever Java and .Net use for strings, which is unlikely to be the same as what CPython does.
Encode how? Just by copying bytes from whatever internal representation the Python compiler happens to use? Anything else requires a codec. You're suggesting raising an exception on non-ASCII strings, so either strings need an internal flag to say whether they're ASCII or not (possibly a good idea regardless), or bytes(s) has to scan the string, not just copy it. To my mind, it sounds like giving bytes a default encoding of 'ascii' might satify you. Or you can write a wrapper: def bytes(s): return builtins.bytes(s, 'ascii') -- Steve

On Tue, Jun 14, 2016, at 21:06, Steven D'Aprano wrote:
I think he's using "char width" to mean "width of the character type in bytes", not "length of the string", and referring to implementation details of the FSR. But at that point, why say it's conceptually ASCII rather than conceptually Latin-1?

On 06/14/2016 04:46 PM, Franklin? Lee wrote:
On Tue, Jun 14, 2016 at 7:26 PM, Guido van Rossum wrote:
The main reason Python 3 is not Python 2 is because text is text and bytes are bytes and there will be no more automagic encoding/decoding betwixt the two. On 06/15/2016 01:55 AM, Franklin? Lee wrote:
- cPython is not the only Python - Latin-1 is an implementation detail, not a language guarantee - PyASCIIObject is (probably) a name left over from Python 2 (massive renames of various structures is usually needless code churn) - it may not have been a bad idea when Python was created, but it is a bad idea now Please put your energy elsewhere because this particular is not going to change. -- ~Ethan~

On 6/14/2016 6:58 PM, Franklin? Lee wrote:
A null case is the only case where an encoding is not needed, but having a special rule is not worth the trouble.
By character size, I presume you mean the PEP 393 internal 1,2,4 bytes per char. Non-ascii latin-1 chars can also be size 1. Not all encodings are ascii compatible. For example, IBM's EBCDIC. -- Terry Jan Reedy

Franklin? Lee writes:
-1 The current rule is simple, the error obvious. The rule you propose is not only more complex, but also not TOOWTDI. Many use cases will want the default encoding, not ASCII. Also, YAGNI. People who really truly do want ASCII because of wire protocol elements that look like English words are likely to be working at the bytes level and not using str at all, especially after PEP 461.

Franklin? Lee wrote:
If the string only has code points in range(128), encoding is optional (and useless anyway).
No, it's not useless. It's possible to have an encoding that takes code points in the range 0-127 to something other than their ASCII equivalents. UTF-16, for example. You're effectively suggesting that ASCII or Latin-1 should be assumed as a default encoding, which seems like a bad idea. -- Greg

On Jun 15, 2016 1:52 AM, "Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote:
UTF-8 is a default encoding for str.encode and bytes.decode. Latin-1 is the internal encoding in CPython whenever possible, and PyASCIIObject is an internal struct in Python 3. It is not exactly alien to Python to choose ASCII as a default. If it is a bad idea, it is not original to me. ASCII has a privileged position among single-byte encodings, even in Python 3. There's no 'builtins.latin1', let alone 'builtins.shiftjis' (though, someone might point out, it's not single-byte). We don't have 're.CA_1'. I could list more things that Python provides for ASCII but not any ASCII-incompatible encodings: https://docs.python.org/3/search.html?q=ascii

On Tue, Jun 14, 2016 at 7:26 PM, Guido van Rossum <guido@python.org> wrote:
-1. Such a check for the contents of the string sounds exactly like the Python 2 behavior we are trying to get away with.
But isn't it really just converting back and forth between two representations of the same thing? A str with char width 1 is conceptually an ASCII string; you're just changing how it's exposed to the program. As it stands, when you have an ASCII string stored as a str, you can call str.encode() on it (whereby it will default to encoding='utf-8'), or you can call `bytes(s, 'utf-8')`, and pass in an argument which is conceptually ignored. (Unless it is in fact not an ASCII string!) On the other hand, `bytes(s)` means, "Encoding shall not be necessary." That could be semantically useful, and a non-ASCII string will trigger an exception, while the other methods will just encode.

On 2016-06-14 19:46, Franklin? Lee wrote:
""" Python 3.5.1+ (default, Mar 30 2016, 22:46:26) [GCC 5.3.1 20160330] on linux Type "help", "copyright", "credits" or "license" for more information.
I think you might mean "code point that will be encoded to a single byte under UTF-8" (i.e. those in the ASCII range), because there are many code points that can be encoded to single bytes under various legacy encodings like ISO-8859-[1-15], the KOI variants, and so on.
What makes you say that the "utf-8" argument to `bytes` is conceptually ignored? That is the encoding used to convert the Unicode code points to bytes! That's by no means the only option either; Python supports multiple ASCII-incompatible encodings like UTF-16, UTF-32, and (as I recall) Shift-JIS, etc. 'utf-8' isn't hardcoded here, either; the encoding used is the one reported by `sys.getdefaultencoding()`. Saying "Unless it is in fact not an ASCII string!" doesn't make any sense either; UTF-8 is still used to convert e.g. U+0041 LATIN CAPITAL LETTER A to the byte 0x41. Even if your text string only contains code points in the range [U+0000, U+007F], this is a very important semantic distinction, so a big -1. MMR...

On Tue, Jun 14, 2016 at 07:46:34PM -0400, Franklin? Lee wrote:
How do you reason that? There are many one-character strings that aren't ASCII, and encode to two or more bytes. py> s = '\N{CJK UNIFIED IDEOGRAPH-4E4D}' py> len(s) 1 py> s.encode('big5') b'\xa5E' Other multibyte encodings include UTF-16 and UTF-32. Even single-byte encodings are not necessarily based on ASCII, e.g. the various EBCDIC codecs. The only string which is likely to return the same bytes under ALL encodings is the empty string. And even that is not guaranteed: imagine that somebody uses the codec machinary to create a string to bytes transformation which is not a pure character translation, e.g. a string-to-bytes version of: py> codecs.encode(b'', 'zip') b'x\x9c\x03\x00\x00\x00\x00\x01' Even ignoring such exotic transformations, it's not worth special casing the single case of '' --> b''. Who is going to write ''.encode() when they could just write b''?
Your comment in parentheses is the crux of the matter: your string may not be an ASCII string, so you don't know if the encoding is conceptually ignored. Besides, it may not be ignored: the FSR is a CPython implementation detail, not a language guarantee. 'A' is not necessarily stored internally as the single byte 0x41. Obvious alternatives are UTF-16 (two bytes) or UTF-32 (four bytes). I expect that Jython and IronPython will use whatever Java and .Net use for strings, which is unlikely to be the same as what CPython does.
Encode how? Just by copying bytes from whatever internal representation the Python compiler happens to use? Anything else requires a codec. You're suggesting raising an exception on non-ASCII strings, so either strings need an internal flag to say whether they're ASCII or not (possibly a good idea regardless), or bytes(s) has to scan the string, not just copy it. To my mind, it sounds like giving bytes a default encoding of 'ascii' might satify you. Or you can write a wrapper: def bytes(s): return builtins.bytes(s, 'ascii') -- Steve

On Tue, Jun 14, 2016, at 21:06, Steven D'Aprano wrote:
I think he's using "char width" to mean "width of the character type in bytes", not "length of the string", and referring to implementation details of the FSR. But at that point, why say it's conceptually ASCII rather than conceptually Latin-1?

On 06/14/2016 04:46 PM, Franklin? Lee wrote:
On Tue, Jun 14, 2016 at 7:26 PM, Guido van Rossum wrote:
The main reason Python 3 is not Python 2 is because text is text and bytes are bytes and there will be no more automagic encoding/decoding betwixt the two. On 06/15/2016 01:55 AM, Franklin? Lee wrote:
- cPython is not the only Python - Latin-1 is an implementation detail, not a language guarantee - PyASCIIObject is (probably) a name left over from Python 2 (massive renames of various structures is usually needless code churn) - it may not have been a bad idea when Python was created, but it is a bad idea now Please put your energy elsewhere because this particular is not going to change. -- ~Ethan~

On 6/14/2016 6:58 PM, Franklin? Lee wrote:
A null case is the only case where an encoding is not needed, but having a special rule is not worth the trouble.
By character size, I presume you mean the PEP 393 internal 1,2,4 bytes per char. Non-ascii latin-1 chars can also be size 1. Not all encodings are ascii compatible. For example, IBM's EBCDIC. -- Terry Jan Reedy

Franklin? Lee writes:
-1 The current rule is simple, the error obvious. The rule you propose is not only more complex, but also not TOOWTDI. Many use cases will want the default encoding, not ASCII. Also, YAGNI. People who really truly do want ASCII because of wire protocol elements that look like English words are likely to be working at the bytes level and not using str at all, especially after PEP 461.

Franklin? Lee wrote:
If the string only has code points in range(128), encoding is optional (and useless anyway).
No, it's not useless. It's possible to have an encoding that takes code points in the range 0-127 to something other than their ASCII equivalents. UTF-16, for example. You're effectively suggesting that ASCII or Latin-1 should be assumed as a default encoding, which seems like a bad idea. -- Greg

On Jun 15, 2016 1:52 AM, "Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote:
UTF-8 is a default encoding for str.encode and bytes.decode. Latin-1 is the internal encoding in CPython whenever possible, and PyASCIIObject is an internal struct in Python 3. It is not exactly alien to Python to choose ASCII as a default. If it is a bad idea, it is not original to me. ASCII has a privileged position among single-byte encodings, even in Python 3. There's no 'builtins.latin1', let alone 'builtins.shiftjis' (though, someone might point out, it's not single-byte). We don't have 're.CA_1'. I could list more things that Python provides for ASCII but not any ASCII-incompatible encodings: https://docs.python.org/3/search.html?q=ascii
participants (11)
-
Ethan Furman
-
Franklin? Lee
-
Greg Ewing
-
Guido van Rossum
-
Matt Ruffalo
-
MRAB
-
Random832
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy