[Python-Dev] Bytes path support
cs at zip.com.au
Thu Aug 21 06:52:19 CEST 2014
On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at noaa.gov> wrote:
>> but disallowing them in higher level
>>> > explicitly cross platform abstractions like pathlib.
>I think the trick here is that posix-using folks claim that filenames are
>just bytes, and indeed they can be passed around with a char*, so they seem
>but you can't possible do anything other than pass them around if you
>REALLY think they are just bytes.
>So really, people treat them as
>maybe a couple others)-is-ascii-compatible"
As someone who fought long and hard in the surrogate-escape listdir() wars, and
was won over once the scheme was thoroughly explained to me, I take issue with
these assertions: they are bogus or misleading.
Firstly, POSIX filenames _are_ just byte strings. The only forbidden character
is the NUL byte, which terminates a C string, and the only special character is
the slash, which separates pathanme components.
Second, a bare low level program cannot do _much_ more than pass them around.
It certainly can do things like compute their basename, or other path related
The "bytes in some arbitrary encoding where at least the slash character (and
maybe a couple others) is ascii compatible" notion is completely bogus. There's
only one special byte, the slash (code 47). There's no OS-level need that it or
anything else be ASCII compatible. I think characterisations such as the one
quoted are activately misleading.
The way you get UTF-8 (or some other encoding, fortunately getting less and
less common) is by convention: you decide in your environment to work in some
encoding (say utf-8) via the locale variables, and all your user-facing text
gets used in UTF-8 encoding form when turned into bytes for the filename calls
because your text<->bytes methods say to do so.
I think we'd all agree it is nice to have a system where filenames are all
Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore
the reality for such systems. I certainly think the Window-side Babel of code
pages and multiple code systems is far far worse. (Disclaimer: not a Windows
programmer, just based on hearing them complain.)
I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac
OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the
underlying filesystems reject invalid byte sequences).
> Antoine Pitrou wrote:
>> To elaborate specifically about pathlib, it doesn't handle bytes paths
>> but allows you to generate them if desired:
>but that uses
>os.fsencode: Encode filename to the filesystem encoding
>As I understand it, the whole problem with some posix systems is that there
>is NO filesystem encoding -- i.e. you can't know for sure what encoding a
>filename is in. So you need to be able to pass the bytes through as they
Yes and no. I made that argument too.
There's no _external_ "filesystem encoding" in the sense of something recorded
in the filesystem that anyone can inspect. But there is the expressed locale
settings, available at runtime to any program that cares to pay attention. It
is a workable situation.
Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly
internally consistent. It just doesn't match what he wants. (Indeed, what I
want, and I'm a long time UNIX fanboy.)
Cameron Simpson <cs at zip.com.au>
God is real, unless declared integer. - Johan Montald, johan at ingres.com
More information about the Python-Dev