[Python-ideas] bytes indexing behavior
Stephen J. Turnbull
stephen at xemacs.org
Tue Jun 7 15:01:08 EDT 2016
Serhiy Storchaka writes:
> I think representing bytes as an array of ints was good decision. If you
> need indexing to return a substring, you should use str instead. It is
> as well memory efficient thanks to PEP 393.
You can do this by using latin-1 as the codec, but that's pretty
unpleasant, because of the risk of combining with another str and
getting mojibake.
I have long thought that it would be interesting to have a codec and
an extension to PEP 393 that gives "asciibytes" behavior. That is,
the codec simply slops the bytes into the 8-bit storage of a string,
but when joined with another string the result types are:
asciibytes other arg result
has 8bit type type
yes pure ascii asciibytes
yes asciibytes asciibytes
yes other str str with 8bit bytes from asciibytes
encoded as PEP 383 surrogateescape
(note: promotes latin1 to 2-byte-wide)
no whatever whatever
I think Nick actually had a module that worked pretty much like this,
but he never pushed it. I've never had time to reason out the
possible failure modes, though, or the performance issues. And it's
not an itch I personally need to scratch.
I believe (but haven't proved) that the failure modes with the above
operation table are the same as for str containing PEP 383
surrogates. I'm not sure what other issues you might run into. Also,
I'm not sure it's reasonable to have an asciibytes with no 8bit bytes.
Steve
More information about the Python-ideas
mailing list