Getting a bit OT, but I *think* this is the story:

I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:

A file path on a *nix system can be any string of bytes, except two special values:

b'\x00'   : null
b'\x2f'    : slash 

(consistent with this SO post, among many other sources: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding)

So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash

So any encoding that uses b'\x2f' for the slash would work. Which seems to include, for instance, UTF-16:

In [31]: "/".encode('utf-16')                                                  
Out[31]: b'\xff\xfe/\x00'

In [40]: [hex(b) for b in "/".encode('utf-16')]                                
Out[40]: ['0xff', '0xfe', '0x2f', '0x0']

However, if one were to actually use that in raw form, and, for instance, split on the \x2f byte, you wouldn't get anything useful.

In [53]: first, second = "first_part/second_part".encode('utf-16').split(b'/')  

In [54]: first.decode('utf-16')                                                
Out[54]: 'first_part'

In [55]: second.decode('utf-16')                                                
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-55-9eec3a9ebb3d> in <module>
----> 1 second.decode('utf-16')

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 22: truncated data

In practice, I suspect that every *nix system uses encoding(s) that are ASCII-compatible for the first 127 values, or more precisely that have the slash character value be a single byte of value: 2f. And as long as that's the case (and that value doesn't show up anywhere else) then software can know nothing about the encoding, and still do two things:

pass it around
split on the slash.

Which may be enough for various system tools. So a fine argument for facilitating their use in things like globbing directories, opening files, etc.

But as soon as you have any interaction with humans, then filenames need to be human meaningful. and as soon as you manipulate the names beyond splitting and merging on a slash, then you do need to know something about the encoding.

In practice, maybe knowing that it's ascii compatible in the first 127 bytes will get pretty far, as you can do things like:

if filename_bytes.endswith(b'.txt'):
    root_name = filename_bytes[:-4]

So adding the new stripsuffix or whatever we call it makes sense. However:

As soon as someone wants to do anything even a bit more sophisticated, that may involve non-ascii characters, that would all go to heck.

And my understanding is that with the 'surrogateescape' error handlers, you can convert to a "maybe right" encoding, manipulate it, and then convert back, using the same encoding.

Though this still goes to heck if the encoding uses more than one byte for the slash. (or a surrogate escape is part of some other manipulation you may do).

Anyway -- this is why it seems like a bad idea to give the bytes object any more "string like" functionality.

But bytes has a pretty full set of "string like" methods now, so I suppose it makes sense to add a couple new ones that are related to ones that are already there.

-CHB


--
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython