[Python-ideas] bytes indexing behavior

Tue Jun 7 15:01:08 EDT 2016

Serhiy Storchaka writes:

 > I think representing bytes as an array of ints was good decision. If you 
 > need indexing to return a substring, you should use str instead. It is 
 > as well memory efficient thanks to PEP 393.

You can do this by using latin-1 as the codec, but that's pretty
unpleasant, because of the risk of combining with another str and
getting mojibake.

I have long thought that it would be interesting to have a codec and
an extension to PEP 393 that gives "asciibytes" behavior.  That is,
the codec simply slops the bytes into the 8-bit storage of a string,
but when joined with another string the result types are:

asciibytes        other arg        result
 has 8bit           type            type
   yes            pure ascii     asciibytes
   yes            asciibytes     asciibytes
   yes            other str      str with 8bit bytes from asciibytes
                                 encoded as PEP 383 surrogateescape
                                 (note: promotes latin1 to 2-byte-wide)
    no             whatever      whatever

I think Nick actually had a module that worked pretty much like this,
but he never pushed it.  I've never had time to reason out the
possible failure modes, though, or the performance issues.  And it's
not an itch I personally need to scratch.

I believe (but haven't proved) that the failure modes with the above
operation table are the same as for str containing PEP 383
surrogates.  I'm not sure what other issues you might run into.  Also,
I'm not sure it's reasonable to have an asciibytes with no 8bit bytes.

Steve