On 07Mar2020 15:01, Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Mar 6, 2020 at 5:54 PM Guido van Rossum <guido@python.org> wrote:
(Since bytes may be used for file names I think they should get this new capability too.)
I don’t really care one way or another, but is it really still the case that bytes need to be used for filenames? For uses other than just passing them around?
Yes, Linux in particular does not guarantee that file names are using any particular encoding (let alone a consistent encoding for different files). The only two bytes that are special are '\0' and '/'.
I *think* I understand the issues. And I can see that some software would need to work with filenames as arbitrary bytes. But that doesn't mean that you can do much with them that way.
Given that the entire UNIX filename API is bytes, I think this isn't very true.
I can see filename.split(b'/') for instance, but how could you strip a prefix or suffix without knowing the encoding?
Well, directly: filename.cutsuffix(b'.abc') But more seriously, you're either treating them as bytes with no particular encoding and the above just means "remove these 4 bytes" or you do know the encoding and are working with strings, so you'd either have a string andcut a string, or have bytes and cut the value '.abc'.encode(encoding=known_encoding). Things like listdir are dual mode: call it with a bytes directory name and you get bytes results, call it with a string directory name and you get string results. There's some funky encoding accomodation in there (read the docs, it's a little subtlety to do with returning strings which didn't decode cleanly from the underlying bytes).
filename.strip_suffix(b'.txt') would only work for ASCII-compaitble encodings.
Or b'.txt' is your known bytes encoding of some known string suffix in your working encoding. But like the other string-like bytes methods, I think there's a good case for supporting bytes prefixes and suffixes; it is just a matter of using the correct bytes affix in the regime you're working in. Might not be filenames, either.
There's no way around the fact that you have to make SOME assumptions about the encoding if you are going to do anything other than pass it around or work with the b'/' byte.
They needn't be assumptions; all code has some outer context.
And if that's the case, then you might as well decode and use 'surrogateescape' so the program won't crash.
Ah, I see you've encountered the listdir-return-string stuff already then.
Getting OT, but I do wonder if we should continue to support (and therefor encourage) the use of bytes in inappropriate ways.
I think there's plenty of reasonable bytes actions which look a lot like string actions, and are not confusing. Consider this contrived example: payload_bytes = packet_bytes.cutprefix(header_bytes) There was an interesting writeup by a guy involved in the mercurial Python 3 port where he discusses the pain which came with the bytes type lacking a lot of the string support methods when Python 3 first came out. He suggests a lot of things would have gone far smoother with these, as Mercurial had a lot of filenames-as-bytes-strings inside. Here we are: https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journey-to-and-reflec... Personally I lean the other way, and welcomed the initial lack of stringish methods as a good way to uncover bytes mistakenly used for strings. But I see his point. Cheers, Cameron Simpson <cs@cskk.id.au>