[Python-ideas] prefix/suffix for bytes (was: New explicit methods to trim strings)

8 Mar 2020

      On 07Mar2020 15:01, Christopher Barker <pythonchb@gmail.com> wrote:
...
On Fri, Mar 6, 2020 at 5:54 PM Guido van Rossum <guido@python.org> wrote:
...
(Since bytes may be used for file names I think they should get this 
new capability too.)
...
I don’t really care one way or another, but is it really still the 
case that bytes need to be used for filenames? For uses other than just passing
them around?
Yes, Linux in particular does not guarantee that file names are using any
particular encoding (let alone a consistent encoding for different files).
The only two bytes that are special are '\0' and '/'.
I *think* I understand the issues. And I can see that some software would
need to work with filenames as arbitrary bytes. But that doesn't mean that
you can do much with them that way.
Given that the entire UNIX filename API is bytes, I think this isn't 
very true.
...
I can see filename.split(b'/') for instance, but how could you strip a
prefix or suffix without knowing the encoding?
Well, directly:

    filename.cutsuffix(b'.abc')

But more seriously, you're either treating them as bytes with no 
particular encoding and the above just means "remove these 4 bytes" or 
you do know the encoding and are working with strings, so you'd either 
have a string andcut a string, or have bytes and cut the value 
'.abc'.encode(encoding=known_encoding).

Things like listdir are dual mode: call it with a bytes directory name 
and you get bytes results, call it with a string directory name and you 
get string results. There's some funky encoding accomodation in there 
(read the docs, it's a little subtlety to do with returning strings 
which didn't decode cleanly from the underlying bytes).
...
filename.strip_suffix(b'.txt') would only work for ASCII-compaitble
encodings.
Or b'.txt' is your known bytes encoding of some known string suffix in 
your working encoding.

But like the other string-like bytes methods, I think there's a good 
case for supporting bytes prefixes and suffixes; it is just a matter of 
using the correct bytes affix in the regime you're working in. Might not 
be filenames, either.
...
There's no way around the fact that you have to make SOME
assumptions about the encoding if you are going to do anything other than
pass it around or work with the b'/' byte.
They needn't be assumptions; all code has some outer context.
...
And if that's the case, then you
might as well decode and use 'surrogateescape' so the program won't crash.
Ah, I see you've encountered the listdir-return-string stuff already 
then.
...
Getting OT, but I do wonder if we should continue to support (and therefor
encourage) the use of bytes in inappropriate ways.
I think there's plenty of reasonable bytes actions which look a lot like 
string actions, and are not confusing. Consider this contrived example:

    payload_bytes = packet_bytes.cutprefix(header_bytes)

There was an interesting writeup by a guy involved in the mercurial 
Python 3 port where he discusses the pain which came with the bytes type 
lacking a lot of the string support methods when Python 3 first came 
out. He suggests a lot of things would have gone far smoother with 
these, as Mercurial had a lot of filenames-as-bytes-strings inside. Here 
we are:

    https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journey-to-and-reflec...

Personally I lean the other way, and welcomed the initial lack of 
stringish methods as a good way to uncover bytes mistakenly used for 
strings. But I see his point.

Cheers,
Cameron Simpson <cs@cskk.id.au>