[Python-Dev] Bytes path support
chris.barker at noaa.gov
Fri Aug 22 00:30:20 CEST 2014
On Wed, Aug 20, 2014 at 9:52 PM, Cameron Simpson <cs at zip.com.au> wrote:
> On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at noaa.gov>
> So really, people treat them as
>> "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
>> maybe a couple others)-is-ascii-compatible"
> As someone who fought long and hard in the surrogate-escape listdir()
> wars, and was won over once the scheme was thoroughly explained to me, I
> take issue with these assertions: they are bogus or misleading.
> Firstly, POSIX filenames _are_ just byte strings. The only forbidden
> character is the NUL byte, which terminates a C string, and the only
> special character is the slash, which separates pathanme components.
so they are "just byte strings", oh, except that you can't have a null,
and the "slash" had better be code 47 (and vice versa). How is that
different than "bytes-in-some-arbitrary-encoding-where-at-least
(sorry about the "maybe a couple others", I was too lazy to do my research
and be sure).
But my point is that python users want to be able to work with paths, and
paths on posix are not strictly strings with a clearly defined encoding,
but they are also not quite "just arbitrary bytes". So it would be nice if
we could have a pathlib that would work with these odd beasts. I've lost
track a bit as to whether the surrogate-escape solution allows this to all
work now. If it does, then great, sorry for the noise.
Second, a bare low level program cannot do _much_ more than pass them
> around. It certainly can do things like compute their basename, or other
> path related operations.
only if you assume that pesky slash == 47 thing -- it's not much, but it's
not raw bytes either.
The "bytes in some arbitrary encoding where at least the slash character
> maybe a couple others) is ascii compatible" notion is completely bogus.
> There's only one special byte, the slash (code 47). There's no OS-level
> need that it or anything else be ASCII compatible. I think
> characterizations such as the one quoted are activately misleading.
code 47 == "slash" is ascii compatible -- where else did the 47 value come
> I think we'd all agree it is nice to have a system where filenames are all
> Unicode, but since POSIX/UNIX predates it by decades it is a bit late to
> ignore the reality for such systems.
well, the community could have gone to "if you want anything other than
ascii, make it utf-8 -- but always, we're all a bunch of independent
But none of this is relevant -- systems in the wild do what they do --
clearly we all want Python to work with them as best it can.
> There's no _external_ "filesystem encoding" in the sense of something
> recorded in the filesystem that anyone can inspect. But there is the
> expressed locale settings, available at runtime to any program that cares
> to pay attention. It is a workable situation.
I haven't run into it, but it seem the folks that have don't think relying
on the locale setting is the least bit workable. If it were, we woldn't be
havin this discussion -- use the locale setting to decide how to decode
filenames -- done.
Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he wants.
> (Indeed, what I want, and I'm a long time UNIX fanboy.)
bug or feature? you decide. Internal consistency is a good start, but it
punts the whole encoding issue to the client software, without giving it
the tools to do it right. I call that "really hard to work with" if not
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev