[Python-ideas] Strings can sometimes convert to bytes without an encoding

Franklin? Lee leewangzhong+python at gmail.com
Tue Jun 14 18:58:06 EDT 2016


Current behavior (3.5.1):
    >>> bytes('')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: string argument without an encoding

Suggestion:
If the character size is 1, the `bytes`/`bytearray` constructor
doesn't need a specified encoding.

High-level idea:
If the string only has code points in range(128), encoding is optional
(and useless anyway). The new error message could be
    TypeError: non-ASCII string argument without an encoding

How:
CPython strings store characters in an array, such that each character
takes a single entry. With an entry per character, indexing is just a
regular C array index operation. Since PEP 393, the size( )of the
elements of the array is just the size needed for the largest
character. Thus, CPython strings "know" whether or not they're ASCII.

Other implementations without PEP 393 can do a scan of the code points
to check the 0-127 condition during building. That means O(n) more
checks, but in those implementations, the per-character checks are
already necessary with an explicit encoding, since you'd need to see
if that character needs encoding.

(The `bytes` and ASCII-`str` could in fact share memory, given a few
tweaks. But that's an implementation detail.)


More information about the Python-ideas mailing list