[Python-ideas] Strings can sometimes convert to bytes without an encoding

Tue Jun 14 20:30:37 EDT 2016

On 2016-06-14 19:46, Franklin? Lee wrote:
> But isn't it really just converting back and forth between two
> representations of the same thing? A str with char width 1 is
> conceptually an ASCII string; you're just changing how it's exposed to
> the program.
Your concept of a "str with char width 1" is not well-defined at all,
and is not true under most ways I can think of to interpret what you've
said.

"""
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '\U0001f4a9'
>>> len(s)
1
>>> len(s.encode('utf-8'))
4
"""

I think you might mean "code point that will be encoded to a single byte
under UTF-8" (i.e. those in the ASCII range), because there are many
code points that can be encoded to single bytes under various legacy
encodings like ISO-8859-[1-15], the KOI variants, and so on.

> As it stands, when you have an ASCII string stored as a str, you can
> call str.encode() on it (whereby it will default to encoding='utf-8'),
> or you can call `bytes(s, 'utf-8')`, and pass in an argument which is
> conceptually ignored. (Unless it is in fact not an ASCII string!) On
> the other hand, `bytes(s)` means, "Encoding shall not be necessary."
> That could be semantically useful, and a non-ASCII string will trigger
> an exception, while the other methods will just encode.

What makes you say that the "utf-8" argument to `bytes` is conceptually
ignored? That is the encoding used to convert the Unicode code points to
bytes! That's by no means the only option either; Python supports
multiple ASCII-incompatible encodings like UTF-16, UTF-32, and (as I
recall) Shift-JIS, etc. 'utf-8' isn't hardcoded here, either; the
encoding used is the one reported by `sys.getdefaultencoding()`.

Saying "Unless it is in fact not an ASCII string!" doesn't make any
sense either; UTF-8 is still used to convert e.g. U+0041 LATIN CAPITAL
LETTER A to the byte 0x41. Even if your text string only contains code
points in the range [U+0000, U+007F], this is a very important semantic
distinction, so a big -1.

MMR...