[Python-Dev] Bytes path support

Thu Aug 21 06:52:19 CEST 2014

On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at noaa.gov> wrote:
>>  but disallowing them in higher level
>>> > explicitly cross platform abstractions like pathlib.
>>
>I think the trick here is that posix-using folks claim that filenames are
>just bytes, and indeed they can be passed around with a char*, so they seem
>to be.
>
>but you can't possible do anything other than pass them around if you
>REALLY think they are just bytes.
>
>So really, people treat them as
>"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
>maybe a couple others)-is-ascii-compatible"

As someone who fought long and hard in the surrogate-escape listdir() wars, and 
was won over once the scheme was thoroughly explained to me, I take issue with 
these assertions: they are bogus or misleading.

Firstly, POSIX filenames _are_ just byte strings. The only forbidden character 
is the NUL byte, which terminates a C string, and the only special character is 
the slash, which separates pathanme components.

Second, a bare low level program cannot do _much_ more than pass them around.  
It certainly can do things like compute their basename, or other path related 
operations.

The "bytes in some arbitrary encoding where at least the slash character (and
maybe a couple others) is ascii compatible" notion is completely bogus. There's 
only one special byte, the slash (code 47). There's no OS-level need that it or 
anything else be ASCII compatible. I think characterisations such as the one 
quoted are activately misleading.

The way you get UTF-8 (or some other encoding, fortunately getting less and 
less common) is by convention: you decide in your environment to work in some 
encoding (say utf-8) via the locale variables, and all your user-facing text 
gets used in UTF-8 encoding form when turned into bytes for the filename calls 
because your text<->bytes methods say to do so.

I think we'd all agree it is nice to have a system where filenames are all 
Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore 
the reality for such systems. I certainly think the Window-side Babel of code 
pages and multiple code systems is far far worse. (Disclaimer: not a Windows 
programmer, just based on hearing them complain.)

I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac 
OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the 
underlying filesystems reject invalid byte sequences).

[...]
> Antoine Pitrou wrote:
>> To elaborate specifically about pathlib, it doesn't handle bytes paths
>> but allows you to generate them if desired:
>> https://docs.python.org/3/library/pathlib.html#operators
>
>but that uses
>os.fsencode:  Encode filename to the filesystem encoding
>
>As I understand it, the whole problem with some posix systems is that there
>is NO filesystem encoding -- i.e. you can't know for sure what encoding a
>filename is in. So you need to be able to pass the bytes through as they
>are.

Yes and no. I made that argument too.

There's no _external_ "filesystem encoding" in the sense of something recorded 
in the filesystem that anyone can inspect. But there is the expressed locale 
settings, available at runtime to any program that cares to pay attention. It 
is a workable situation.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly 
internally consistent. It just doesn't match what he wants. (Indeed, what I 
want, and I'm a long time UNIX fanboy.)

Cheers,
Cameron Simpson <cs at zip.com.au>

God is real, unless declared integer.   - Johan Montald, johan at ingres.com