[Python-Dev] When should pathlib stop being provisional?

Tue Apr 5 23:18:09 EDT 2016

On Wed, Apr 6, 2016 at 12:51 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Wed, Apr 06, 2016 at 10:02:30AM +1000, Chris Angelico wrote:
>
>> My personal view on the text/bytes debate is that a path is
>> fundamentally a human concept, and consists therefore of text. The
>> fact that some file systems store (at the low level) bytes and some
>> store (I think) UTF-16 code units should be immaterial; path
>> components exist for people. We can smuggle unrecognized bytes around,
>> but ultimately, those bytes came from characters at some point - we
>> just don't know the encoding. So a Path object has no relationship
>> with bytes, only with str.
>
> That might be usually true in practice, but it is incorrect in
> principle. Paths in POSIX systems like Linux are fundamentally
> byte-strings with only two restrictions: \0 and \x2f are forbidden.

That's the file system level. But more fundamentally than that, a path
exists so that humans can refer to files. That's why they have
*names*, not just dirent numbers. We could assign dirent number -1 to
mean "parent directory", and then represent everything with tuples of
directory entries. Follow the chain and you get an inode. Absolute
paths would start with an inode (the root directory being inode 2) and
proceed with dirents thereafter. Maybe we'd need a pseudo-inode to
mean "current directory". Should we do paths like this? No way! Much
better to have either "/home/rosuav/cpython/python" or (P.ROOT,
"home", "rosuav", "cpython", "python") to represent them, because they
exist for the human.

The POSIX file system rules aren't insignificant, but my point is that
every byte value seen in a file name was once representing a
character. Outside of deliberate tests, we don't create files on our
disks whose names are strings of random bytes; the normal use of a
file system is to store files that a human has named. Hence my
recommendation that a Path object be tied to str, but *not* to bytes.

> The fact that paths in Linux mostly happen to look like English words
> (often heavily abbreviated) is a historical accident. The file system
> itself supported paths containing (say) \xff even back in the days when
> text was pure US-ASCII and bytes over \x7f had no textual meaning, and
> these days paths still support sequences of bytes that have no human
> meaning in any encoding.
>
> I don't know if this makes the tiniest lick of difference for Pathlib. I
> would be perfectly content if we stuck with the design decision that
> Pathlib can only represent paths representable as Unicode strings, and
> left weird POSIX filenames to the legacy byte-string interface.

I'd prefer to keep the surrogateescape compatibility hack with U+DC00
to U+DCFF being used to smuggle bytes around. That means that every
path can be represented as a Unicode string, with only minor loss of
functionality (imagine a path with only a single character that can't
be decoded - chances are a human can figure out what the file is), but
it still strongly pushes to a Unicode interpretation of the path.

An *actual* byte-string interface (such as os.listdir and friends
support) would be completely outside of anything involving Pathlib. If
you give bytes, you'll get bytes. And I'd deprecate that once Path
objects are more broadly accepted.

ChrisA