[Python-Dev] When should pathlib stop being provisional?
steve at pearwood.info
Tue Apr 5 22:51:55 EDT 2016
On Wed, Apr 06, 2016 at 10:02:30AM +1000, Chris Angelico wrote:
> My personal view on the text/bytes debate is that a path is
> fundamentally a human concept, and consists therefore of text. The
> fact that some file systems store (at the low level) bytes and some
> store (I think) UTF-16 code units should be immaterial; path
> components exist for people. We can smuggle unrecognized bytes around,
> but ultimately, those bytes came from characters at some point - we
> just don't know the encoding. So a Path object has no relationship
> with bytes, only with str.
That might be usually true in practice, but it is incorrect in
principle. Paths in POSIX systems like Linux are fundamentally
byte-strings with only two restrictions: \0 and \x2f are forbidden.
The fact that paths in Linux mostly happen to look like English words
(often heavily abbreviated) is a historical accident. The file system
itself supported paths containing (say) \xff even back in the days when
text was pure US-ASCII and bytes over \x7f had no textual meaning, and
these days paths still support sequences of bytes that have no human
meaning in any encoding.
I don't know if this makes the tiniest lick of difference for Pathlib. I
would be perfectly content if we stuck with the design decision that
Pathlib can only represent paths representable as Unicode strings, and
left weird POSIX filenames to the legacy byte-string interface.
More information about the Python-Dev