pathlib - current status of discussions

name: ---- We are down to two choices: - __fspath__, or - __fspathname__ The final choice I suspect will be affected by the choice to allow (or not) bytes. method or attribute: ------------------- method built-in: -------- Almost - we'll put it in the os module add to str: ---------- No, not all strings are paths. add to C API: ------------ Yes. Possible names include PyUnicode_FromFSPath and PyObject_Path -- again, the choice of bytes inclusion will affect the final choice of name. add a Path ABC: -------------- undecided Sticking points: --------------- Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()? -- ~Ethan~

On Apr 11, 2016, at 5:58 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
name: ----
We are down to two choices:
- __fspath__, or - __fspathname__
The final choice I suspect will be affected by the choice to allow (or not) bytes.
+1 on __fspath__, -0 on __fspathname__
add a Path ABC: --------------
undecided
I think it makes sense to add it, but maybe only in 3.6? Path accepting code could be updated to do something like `isinstance(obj, (bytes, str, PathMeta))` which seems like a net win to me.
Sticking points: ---------------
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
I think yes and yes, it seems like making it needlessly harder to deal with a bytes path in the scenarios that you’re actually dealing with them is the kind of change that 3.0 made that ended up getting rolled back where it could. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Donald Stufft writes:
I think yes and yes [__fspath__ and fspath should be allowed to handle bytes, otherwise] it seems like making it needlessly harder to deal with a bytes path
It's not needless. This kind of polymorphism makes it hard to review code locally. Once bytes get a foothold inside a text application, they metastasize altogether too easily, and you end up with TypeErrors or UnicodeErrors quite far from the origin. Debugging often requires tracing data flows over hill and over dale while choking from the dusty trail, or band-aids like a top-level "except UnicodeError: log_and_quarantine(bytes)". I can't prove that returning bytes from these APIs is a big risk in this sense, but I can't see a way to prove that it's not, either, given that their point is duck-typing, and therefore they may be generalized in the future, and by third parties. I understand that there are applications where it's bytes all the way down, but by the very nature of computing systems, there are systems where bytes are decoded to text. For historical reasons (the encoding Tower of Babel), it's very error-prone to do that on demand. Best practice is to do the conversion as close to the boundary as possible, and process only text internally. In text applications, "bytes as carcinogen" is an apt metaphor. Now, I'm not Dutch, so I can't tell you it's obvious that the risk to text-processing applications is more important than the inconvenience to byte-shoveling applications. But there is a need to be parsimonious with polymorphism.

On 12 April 2016 at 06:28, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Donald Stufft writes:
I think yes and yes [__fspath__ and fspath should be allowed to handle bytes, otherwise] it seems like making it needlessly harder to deal with a bytes path
It's not needless. This kind of polymorphism makes it hard to review code locally. Once bytes get a foothold inside a text application, they metastasize altogether too easily, and you end up with TypeErrors or UnicodeErrors quite far from the origin. Debugging often requires tracing data flows over hill and over dale while choking from the dusty trail, or band-aids like a top-level "except UnicodeError: log_and_quarantine(bytes)". I can't prove that returning bytes from these APIs is a big risk in this sense, but I can't see a way to prove that it's not, either, given that their point is duck-typing, and therefore they may be generalized in the future, and by third parties.
I understand that there are applications where it's bytes all the way down, but by the very nature of computing systems, there are systems where bytes are decoded to text. For historical reasons (the encoding Tower of Babel), it's very error-prone to do that on demand. Best practice is to do the conversion as close to the boundary as possible, and process only text internally.
In text applications, "bytes as carcinogen" is an apt metaphor.
Now, I'm not Dutch, so I can't tell you it's obvious that the risk to text-processing applications is more important than the inconvenience to byte-shoveling applications. But there is a need to be parsimonious with polymorphism.
As someone who has done a lot of work helping projects to port from the 2.x bytes/text model to the 3.x model, I have similar concerns that rooting out the source of bytes objects appearing in a program could be an issue with the proposed "return either" approach. The most effective tool I have found in fixing programs with text/bytes issues is carefully and thoroughly annotating precisely which functions accept and return bytes, and which accept and return text. The sort of mixed-mode processing we're talking about here makes that substantially harder. And note that the signature of os.fspath can return bytes or text *independent* of the type of the argument - it's not a "bytes in, bytes out" function like the usual pattern of "polymorphic support for bytes". But just like Stephen, I have no feel for how significant the risk will be in real life. I've never worked on code that actually has a need for bytestring paths (particularly now that surrogateescape ensures that most cases "just work"). Paul

On 12 April 2016 at 15:28, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Donald Stufft writes:
I think yes and yes [__fspath__ and fspath should be allowed to handle bytes, otherwise] it seems like making it needlessly harder to deal with a bytes path
It's not needless. This kind of polymorphism makes it hard to review code locally. Once bytes get a foothold inside a text application, they metastasize altogether too easily, and you end up with TypeErrors or UnicodeErrors quite far from the origin. Debugging often requires tracing data flows over hill and over dale while choking from the dusty trail, or band-aids like a top-level "except UnicodeError: log_and_quarantine(bytes)". I can't prove that returning bytes from these APIs is a big risk in this sense, but I can't see a way to prove that it's not, either, given that their point is duck-typing, and therefore they may be generalized in the future, and by third parties.
I understand that there are applications where it's bytes all the way down, but by the very nature of computing systems, there are systems where bytes are decoded to text. For historical reasons (the encoding Tower of Babel), it's very error-prone to do that on demand. Best practice is to do the conversion as close to the boundary as possible, and process only text internally.
One possible way to address this concern would be to have the underlying protocol be bytes/str (since boundary code frequently needs to handle the paths-are-bytes assumption in POSIX), but offer an "os.fspathname" API that rejected bytes output from os.fspath. That is, it would be equivalent to: def fspathname(path): name = os.fspath(path) if not isinstance(name, str): raise TypeError("Expected str for pathname, not {}".format(type(name))) return name That way folks that wanted the clean "must be str" signature could use os.fspathname, while those that wanted to accept either could use the lower level os.fspath. The ambiguity in question here is inherent in the differences between the way POSIX and Windows work, so there are limits to how far we can go in hiding it without making things worse rather than better. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
One possible way to address this concern would be to have the underlying protocol be bytes/str (since boundary code frequently needs to handle the paths-are-bytes assumption in POSIX),
What "needs"? As has been pointed out several times, with PEP 383 you can deal with bytes losslessly by using an arbitrary codec and errors=surrogateescape. I know why *I* use bytes nevertheless: because when I must guess the encoding, it just makes more sense to read bytes and then iterate over codecs until the result looks like words I know in some language. I don't understand why people who mostly believe "bytes are text, too" because almost all they ever see are bytes in the range 0x00-0x7f need bytes. For them, fsdecode and fsencode DTRT. If you want to claim "efficiency", I can't gainsay since I don't know the applications, but if you're trying to manipulate file names millions of times per second, I have to wonder what you're doing with them that benefits so much from Path.
but offer an "os.fspathname" API that rejected bytes output from os.fspath.
Either it's a YAGNI because I'm not going to get any bytes in the first place, or it raises where I probably could have done something useful with bytes if I were expecting them (see "pathological" below).
That way folks that wanted the clean "must be str" signature
Er, I don't need no steenkin' "clean signature". I need str, and if I can't get it from __fspath__, there's always os.fsdecode. But this is serious horse-before cart-putting, punishing those who do things Python-3-ishly right.
The ambiguity in question here is inherent in the differences between the way POSIX and Windows work,
Not with PEP 383, it's not. And I don't do Windows, so my preference for str has nothing to do with it mapping to native OS APIs well. The ambiguity in question here is inherent in the differences between the ways Python 2 and Python 3 programmers work on POSIX AFAICS. Certainly, there will be times when fsdecode doesn't DTRT. So those times you have to use an explicit bytes.decode. Note that when you *do* care enough to do that, it's because the Path is *text* -- you're going to display it to a human, or pass it out of the module. If all you're going to do is access the filesystem object denoted, fsdecode does a sufficiently accurate job. So if for some reason you're getting bytes at the boundary, I see no reason why you can't have a convenience constructor def pathological(str_or_bytes_or_path_seq): args = [] for s_o_b in str_or_bytes_or_path_seq: args.append(os.fsdecode(s_o_b) if isinstance(s_o_b, bytes) else s_o_b) return pathlib.Path(str_or_path_list) for when that's good enough (maybe Antoine would even allow it into pathlib?)
so there are limits to how far we can go in hiding it without making things worse rather than better.
What "hide"? Nobody is suggesting that the polymorphic os APIs should go away. Indeed, they are perfect TOOWTDI, giving the programmer exactly the flexibility needed *and no more*, *at* the boundary. The questions on my mind are: (A) Why does anybody need bytes out of a pathlib.Path (or other __fspath__-toting, higher-level API) *inside* the boundary? Note that the APIs in os (etc) *don't need* bytes because they are already polymorphic. (B) If they do, why can't they just apply bytes() to the object? I understand that that would offend Ethan's aesthetic sense, so it's worth looking for a nice way around it. But allowing __fspath__ to return bytes or str is hideous, because Paths are clearly on the application side of the boundary. Note that bytes() may not have the serious problem that str() does of being too catholic about its argument: nothing in __builtins__ has a __bytes__! Of course there are a few things that do work: ints, and sequences of ints.

On Tue, Apr 12, 2016 at 6:52 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
(A) Why does anybody need bytes out of a pathlib.Path (or other __fspath__-toting, higher-level API) *inside* the boundary? Note that the APIs in os (etc) *don't need* bytes because they are already polymorphic.
Indeed not from pathlib.*Path , but from DirEntry, which may have a path as bytes. So the options for DirEntry (or things like Ethan's 'antipathy') are: (1) Provide bytes or str via the protocol, depending on which type this DirEntry has Downside: The protocol needs to support str and bytes. (2) Decode bytes using os.fsdecode and provide a str via the protocol Downside: The user passed in bytes and maybe had a reason to do so. This might lead to a weird mixture of str and bytes in the same code. (3) Do not implement the protocol when dealing with bytes Downside: If a function calling os.scandir accepts both bytes and str in a duck-typing fashion, then, if this adopted something that uses the new protocol, it will lose its bytes compatiblity. This risk might not be huge, so perhaps (3) is an option?
(B) If they do, why can't they just apply bytes() to the object? I understand that that would offend Ethan's aesthetic sense, so it's worth looking for a nice way around it. But allowing __fspath__ to return bytes or str is hideous, because Paths are clearly on the application side of the boundary.
Note that bytes() may not have the serious problem that str() does of being too catholic about its argument: nothing in __builtins__ has a __bytes__! Of course there are a few things that do work: ints, and sequences of ints.
Good point. But this only applies to when the user _explicitly_ deals with bytes. But when the user just deals with the type (str or bytes) that is passed in, as os.path.* as well as DirEntry now do, this does not work. -Koos

On Tue, Apr 12, 2016 at 11:56 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
One possible way to address this concern would be to have the underlying protocol be bytes/str (since boundary code frequently needs to handle the paths-are-bytes assumption in POSIX), but offer an "os.fspathname" API that rejected bytes output from os.fspath. That is, it would be equivalent to:
def fspathname(path): name = os.fspath(path) if not isinstance(name, str): raise TypeError("Expected str for pathname, not {}".format(type(name))) return name
That way folks that wanted the clean "must be str" signature could use os.fspathname, while those that wanted to accept either could use the lower level os.fspath.
I'm not necessarily opposed to this. I kept bringing up bytes in the discussion because os.path.* etc. and DirEntry support bytes and will need to keep doing so for backwards compatibility. I have no intention to use bytes pathnames myself. But it may break existing code if functions, for instance, began to decode bytes paths to str if they did not previously do so (or to reject them). It is indeed a lot safer to make new code not support bytes paths than to change the behavior of old code. But then again, do we really recommend new code to use os.fspath (or os.fspathname)? Should they not be using either pathlib or os.path.* etc. so they don't have to care? I'm sure Ethan and his library (or some other path library) will manage without the function in the stdlib, as long as the dunder attribute is there. So I'm, once again, posing this question (that I don't think got any reactions previously): Is there a significant audience for this new function, or is it enough to keep it a private function for the stdlib to use? That handful of third-party path libraries can decide for themselves if they want to (a) reject bytes or (b) implicitly fsdecode them or (c) pass them through just like str, depending on whatever their case requires in terms of backwards compatiblity or other goals. If we forget about the os.fswhatever function, we only have to decide whether the magic dunder attribute can be str or bytes or just str. -Koos

On 04/12/2016 09:26 AM, Koos Zevenhoven wrote:
So I'm, once again, posing this question (that I don't think got any reactions previously): Is there a significant audience for this new function, or is it enough to keep it a private function for the stdlib to use?
Quite frankly, I expect the stdlib itself to be the primary consumer. But I see no reason to not publish the function so that users who need the advanced functionality have easy access to it. -- ~Ethan~

On 12 April 2016 at 07:58, Ethan Furman <ethan@stoneleaf.us> wrote:
Sticking points: ---------------
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
I've come around to the point of view that allowing both str and bytes-like objects to pass through unchanged makes sense, with the rationale being the one someone mentioned regarding ease-of-use in os.path. Consider os.path.join: with a permissive os.fspath, the necessary update should just be to introduce "map(os.fspath, args)" (or its C equivalent), and then continue with the existing bytes vs str handling logic. Functions consuming os.fspath can then decide on a case-by-case basis how they want to handle binary paths: either use them as is (which will usually work on mostly-ASCII systems), convert them to text with os.fsdecode (which will usually work on *nix systems), or disallow them entirely (which would probably only be appropriate for libraries that wanted to ensure support for non-ASCII paths on Windows systems). That then cascades into the other open questions mentioned: - permitted return types for both fspath and __fspath__ would be (str, bytes) - the names would be fspath and __fspath__, since the result may be either a path name as text, or an encoded path name as bytes Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 April 2016 at 13:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
Consider os.path.join: with a permissive os.fspath, the necessary update should just be to introduce "map(os.fspath, args)" (or its C equivalent), and then continue with the existing bytes vs str handling logic.
That does remind me: once a patch is available, we should check the benchmark numbers with the patch applied. I'd expect the new protocol overhead to be swamped by the actual IO costs, but this kind of low level change can have surprising consequences. Regarding the type checks, PyObject_AsFilesystemPath (or whatever we call it) will be implemented in C, with os.fspath just calling that, so doing "PyUnicode_Check(path) || PyBytes_Check(path)" on the result will be both cheap and convenient for API consumers (since it means they know they only have to cope with bytes or str instances internally, and will get a clear error message if handed something else). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

with the rationale being the one someone mentioned regarding ease-of-use in os.path.
Consider os.path.join:
Why in the world do the os.path functions need to work with Path objects? ( and other conforming objects) Thus all started with the goal of using Path objects in the stdlib, but that's for opening files, etc. Path is an alternative to os.path -- you don't need to use both. And if you do have a byte path, you can stick with os.path.... BTW, I'm confused about what a bytes path IS -- is it encoded? Can you assume it can be decoded ? It seems to me that the ONLY time you should get a byte path is from a low level system call on a posix system, and you may have no idea how it's encoded. So the ONLY thing you should do with it is pass it along to another low level system call. I can't see why we should support anything else with bytes objects.
- the names would be fspath and __fspath__, since the result may be either a path name as text, or an encoded path name as bytes
You just used the phrase "path name as bytes" -- so why is __pathname__ inappropriate if it might return bytes? I like __pathname__ better because this entire effort is because we' be decided itMs important to make the distinction between a "path" and the text representation of said path. Just sayin' -CHB

Chris Barker - NOAA Federal wrote:
Why in the world do the os.path functions need to work with Path objects?
So that applications using path objects can pass them to library code that uses os.path to manipulate them.
I'm confused about what a bytes path IS -- is it encoded?
It's a sequence of bytes identifying a file. Often it will be an encoding of som piece of text in the file system encoding, but there's no guarantee of that.
Can you assume it can be decoded ?
Only if you use an encoding in which all byte sequences are valid, such as latin1 or utf8+surrogateescape.
So the ONLY thing you should do with it is pass it along to another low level system call.
Not quite -- you can separate it into components and work with them. Essentially the same set of operations that os.path provides.
- the names would be fspath and __fspath__, since the result may be either a path name as text, or an encoded path name as bytes
I like __pathname__ better because this entire effort is because we' be decided itMs important to make the distinction between a "path" and the text representation of said path.
I agree -- the term "pathname" can cover both text and bytes. When posix talks about pathnames it's really talking about bytes. -- Greg

On Mon, Apr 11, 2016 at 10:40 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
So the ONLY thing
you should do with it is pass it along to another low level system call.
Not quite -- you can separate it into components and work with them. Essentially the same set of operations that os.path provides.
ahh yes, so while posix claims that paths are "just a char*", they are really bytes where we can assume that the byte with value 2F is the pathsep (and that 2E separates an extension?), so I suppose os.path is useful. But I still think that most of us should never deal with bytes paths, and the few that need to should just work with the low level functions and be done with it. One more though came up just now: there are different level sof abstractions and representations for paths. We don't want to make Path a subclass of string, because Path is supposed to be a higher level abstraction -- good. then at the bottom of the stack, we NEED the bytes level path, because that what ultimately gets passed to the OS. THe legacy from the single-byte encoding days is that bytes and strings were the same, so we could let people work with nice human readable strings, while also working with byte paths in the same way -- but those days are gone -- py3 make s clear (and important) distiction between nice human readable strings and the bytes that represent them. So: why use strings as the lingua franca of paths? i.e. the basis of the path protocol. maybe we should support only two path representations: 1) A "proper" path object -- i.e. pathlib.Path or anything else that supports the path protocol. 2) the bytes that the OS actually needs. this would mean that the protocol would be to have a __pathbytes__() method that woulde return the bytes that should be passed off to the OS. A posix Path implementation could store that internal bytes representation, so it could pass it off unchanged if that's all you need to do. Any current API that takes bytes could be made to easily work. I'm SURE I'm missing something really big here, but it seems like maybe it's better to get farther from "strings as paths" rather than closer to it.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Apr 12, 2016 at 7:19 PM, Chris Barker <chris.barker@noaa.gov> wrote:
One more though came up just now: there are different level sof abstractions and representations for paths. We don't want to make Path a subclass of string, because Path is supposed to be a higher level abstraction -- good.
then at the bottom of the stack, we NEED the bytes level path, because that what ultimately gets passed to the OS.
THe legacy from the single-byte encoding days is that bytes and strings were the same, so we could let people work with nice human readable strings, while also working with byte paths in the same way -- but those days are gone -- py3 make s clear (and important) distiction between nice human readable strings and the bytes that represent them.
So: why use strings as the lingua franca of paths? i.e. the basis of the path protocol. maybe we should support only two path representations:
1) A "proper" path object -- i.e. pathlib.Path or anything else that supports the path protocol.
2) the bytes that the OS actually needs.
You do have a point there. But since bytes pathnames are deprecated on windows, this seems to lead to supporting both str and bytes in the protocol, or having two protocols __fspathbytes__ and __fspathstr__ (and one being preferred over the other, potentially even depending on the platform)., -Koos

On Tue, Apr 12, 2016 at 9:32 AM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
1) A "proper" path object -- i.e. pathlib.Path or anything else that supports the path protocol.
2) the bytes that the OS actually needs.
You do have a point there. But since bytes pathnames are deprecated on windows,
Ah -- there's the fatal flaw -- even Windows needs bytes at the lowest level, but the decision was already made there to use str as the the lingua-franca -- i.e. the user NEVER sees a path as a bytestring on Windows? I guess that's decided then. str is the exchange format. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Apr 12, 2016, at 12:40, Chris Barker wrote:
Ah -- there's the fatal flaw -- even Windows needs bytes at the lowest level,
Only in the sense that literally everything's bytes at the lowest level. But the bytes Windows needs are not in an ASCII-compatible encoding so it's not reasonable to talk about them in the same way as every other kind of bytes filename.
but the decision was already made there to use str as the the lingua-franca -- i.e. the user NEVER sees a path as a bytestring on Windows? I guess that's decided then. str is the exchange format.

On 13 April 2016 at 02:19, Chris Barker <chris.barker@noaa.gov> wrote:
So: why use strings as the lingua franca of paths? i.e. the basis of the path protocol. maybe we should support only two path representations:
1) A "proper" path object -- i.e. pathlib.Path or anything else that supports the path protocol.
2) the bytes that the OS actually needs.
this would mean that the protocol would be to have a __pathbytes__() method that woulde return the bytes that should be passed off to the OS.
The reason to favour strings over raw bytes for path manipulation is the same reason to favour them anywhere else: to avoid having to worry about encodings *while* you're manipulating things, and instead only worry about the encoding when actually talking to the OS (which may be UTF-16-LE to talk to a Windows API, or UTF-8 to talk to a *nix API, or something else entirely if your OS is set up that way, or you're writing the path to a file or network packet, rather than using it locally). Regardless of what we decide about os.fspath's return type, that general principle won't change - if you're manipulating bytes paths directly, you're doing something relatively specialised (like working on CPython's own os module). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 04/11/2016 10:14 PM, Chris Barker - NOAA Federal wrote:
Consider os.path.join:
Why in the world do the os.path functions need to work with Path objects? ( and other conforming objects)
Because library XYZ that takes a path and wants to open it shouldn't have to care whether that path is a string or pathlib.Path -- but if os.open can't use pathlib.Path then the library has to care (or the user has to care).
This all started with the goal of using Path objects in the stdlib, but that's for opening files, etc.
Etc. as in os.join? os.stat? os.path.split?
Path is an alternative to os.path -- you don't need to use both.
As a user you don't, no. As a library that has no control over what kind of "path" is passed to you -- well, if os and os.path can accept Path objects then you can just use os and os.path; otherwise you have to use os and os.path if passed a str or bytes, and pathlib.Path if passed a pathlib.Path -- so you do have to use both.
- the names would be fspath and __fspath__, since the result may be either a path name as text, or an encoded path name as bytes
You just used the phrase "path name as bytes" -- so why is __pathname__ inappropriate if it might return bytes?
No, he used the phrase "*encoded* path name as bytes". Names are typically represented as text, and since bytes might be returned we don't want a signal that says text.
I like __pathname__ better because this entire effort is because we' be decided itMs important to make the distinction between a "path" and the text representation of said path.
No, this entire effort is to make pathlib work with the rest of the stdlib. -- ~Ethan~

Sorry for disturbing this thread's harmony. On 12.04.2016 08:00, Ethan Furman wrote:
On 04/11/2016 10:14 PM, Chris Barker - NOAA Federal wrote:
Consider os.path.join:
Why in the world do the os.path functions need to work with Path objects? ( and other conforming objects)
Because library XYZ that takes a path and wants to open it shouldn't have to care whether that path is a string or pathlib.Path -- but if os.open can't use pathlib.Path then the library has to care (or the user has to care).
This all started with the goal of using Path objects in the stdlib, but that's for opening files, etc.
Etc. as in os.join? os.stat? os.path.split?
Path is an alternative to os.path -- you don't need to use both.
I agree with that quote of Chris.
As a user you don't, no. As a library that has no control over what kind of "path" is passed to you -- well, if os and os.path can accept Path objects then you can just use os and os.path; otherwise you have to use os and os.path if passed a str or bytes, and pathlib.Path if passed a pathlib.Path -- so you do have to use both.
I don't agree here. There's no need to increase the convenience for a library maintainer when it comes to implicit conversions. When people want to use your library and it requires a string, the can simply use "my_path.path" and everything still works for them when they switch to pathlib. Best, Sven

The following is my opinion, as will become obvious, but it's based on over a decade of observing these lists, and other open source development lists. In a context where some core developers have unsubscribed from these lists, and others regularly report muting threads with a certain air of asperity, I think it's worth the risk of seeming arrogant to explain some of the customs (which are complex and subtle) around posting to Python developer lists. I'm posting publicly because there are several new developers whose activity and fresh perspective is very welcome, but harmony *is* being disturbed, IMO unnecessarily. This particular post caught my eye, but it's only an example of one of the most unharmonious posting styles that has become common recently. Attribution deliberately removed.
Sorry for disturbing this thread's harmony.
*sigh* There is way too much of this on Python-Ideas recently, and there shouldn't be any on Python-Dev. So please don't. Specifically, disagreement with an apparently developing consensus is fine but please avoid this:
Path is an alternative to os.path -- you don't need to use both.
I agree with that quote of Chris.
It's a waste of time to post *what* you agree with.[1] Decisions are not taken by vote in this community, except for the color of the bikeshed, where it is agreed that *what* decision is taken doesn't matter, but that some decision should be taken expeditiously.[2] Chris already stated this position clearly and it's not a "color", so there is no need to reiterate. It simply wastes others' time to read it. (Whether it was a waste of the poster's time is not for me to comment on.) What matters to the decision is *why* you agree (or disagree). If you think that some of Chris's arguments are bogus (and should be disregarded) and others are important, that is valuable information. It's even better if you can shed additional light on the matter (example below). Also, expression of agreement is often a prelude to a request for information. "I agree with Z's post. At least, I have never needed X. *When* do you need X? Let's look for a better way than X!" Unsupported (dis)agreement to statements about "needs" also may be taken as *rude*, because others may infer your arrogant claim to know what *they* do or don't need. Admittedly there's a difficult distinction here between Chris's *idiom* where "you don't need to" translates to "In my understanding, it is generally not necessary to", and your *unsupported* agreement, which in my dialect of English changes the emphasis to imply you know better than those who disagree with you and Chris. And, of course, the position that others are "too easily offended" is often reasonable, but you should be aware that there will be an impact on your reputation and ability to influence development of Python (even if it doesn't come near the point where a moderator invokes "Code of Conduct"). "Me too" posts aren't entirely forbidden, but I feel that in Python custom they are most appropriate when voting on bikeshed colors, and as applause for a *technically* excellent suggestion. They should be avoided in the context of value judgments (of "need" and "simplicity", for example) for the reason given above.
When people want to use your library and it requires a string, the can simply use "my_path.path" and everything still works for them when they switch to pathlib.
This is disrespectful in tone. I don't know if you're responding to Ethan here, but he's one of the authors in question. We *know* that Ethan doesn't like such inelegant idioms -- he said so -- where "this object has an appropriate conversion to your argument type, so you should apply it implicitly" is unambiguous.[3] So for him, it's *not* so simple. Since it's not a matter of voting, each proponent should provide more contexts where preferred programming idioms are "Pythonic" to sway the sense of the community, or if necessary, the BDFL. Where that aesthetic came up was in the context of consistently wrapping arguments that might be Paths in str, as in p = Path(*stuff) or defaultstring # 500 lines crossing function and module boundaries! with open(str(p)) as f: process(f) I think it was Nick who posted agreement with Ethan on the aesthetics of str-wrapping. If that were all, he probably wouldn't have posted (see fn. 1), but he further pointed out that this application of str is *dangerous* because *everything* in Python can be coerced to str. That was a very valuable observation, which swayed the list in favor of "Uh-oh, we can't recommend 'os.method(str(Path))'!" This is my last post on this particular topic, but I will be happy to discuss off-list. (I may discuss further in public on my blog, but first I have to get a blog. :-) Footnotes: [1] "You" is generic here. There are a couple of developers whose agreement has the status of pronouncement of Pythonicity. Aspire to that, but don't assume it -- very few have it, and it's actually *very* rarely exercised. And you can recognize them because they are *asked* to pronounce -- by people whose statements you thought were already authoritative! [2] And even so votes are often overturned by later arguments, both theoretical and based in experience. See for example the several threads over time on the naming of Py_XSETREF. [3] Interpreting Zen koans frequently requires figure-ground inversion. In this case we can apply "In the face of ambiguity, refuse to guess" in the form "in the absence of ambiguity, don't wait to be asked". I'm hardly authoritative, but FWIW :-) I think Ethan's esthetic sense here accords with Pythonicity.

On Wed, Apr 13, 2016 at 5:56 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
The following is my opinion, as will become obvious, but it's based on over a decade of observing these lists, and other open source development lists. In a context where some core developers have unsubscribed from these lists, and others regularly report muting threads with a certain air of asperity, I think it's worth the risk of seeming arrogant to explain some of the customs (which are complex and subtle) around posting to Python developer lists. I'm posting publicly because there are several new developers whose activity and fresh perspective is very welcome, but harmony *is* being disturbed, IMO unnecessarily.
Thank you for this thoughtful post. While none of the quotes you refer to are mine, I did try to find whether any of the advice is something I should learn from. While I didn't find a whole lot (please do correct me if you think otherwise), it is also valuable to hear these things from someone more experienced, even just to confirm what I may have thought or guessed. I can't really tell, but possibly some of the thoughts are interesting even to people significantly more experienced than me. I know you are not interested in discussing this further here, but I'll add some inexperienced points of view inline below, just in case someone is interested:
This particular post caught my eye, but it's only an example of one of the most unharmonious posting styles that has become common recently. Attribution deliberately removed.
Sorry for disturbing this thread's harmony.
*sigh* There is way too much of this on Python-Ideas recently, and there shouldn't be any on Python-Dev. So please don't. Specifically, disagreement with an apparently developing consensus is fine but please avoid this:
Path is an alternative to os.path -- you don't need to use both.
I agree with that quote of Chris.
It's a waste of time to post *what* you agree with.[1] Decisions are not taken by vote in this community, except for the color of the bikeshed, where it is agreed that *what* decision is taken doesn't matter, but that some decision should be taken expeditiously.[2] Chris already stated this position clearly and it's not a "color", so there is no need to reiterate. It simply wastes others' time to read it. (Whether it was a waste of the poster's time is not for me to comment on.)
What matters to the decision is *why* you agree (or disagree). If you think that some of Chris's arguments are bogus (and should be disregarded) and others are important, that is valuable information. It's even better if you can shed additional light on the matter (example below).
Also, expression of agreement is often a prelude to a request for information. "I agree with Z's post. At least, I have never needed X. *When* do you need X? Let's look for a better way than X!"
That's what I thought too. I remember several times recently that I have mentioned I agreed about something, then continuing to add more to it, or even saying I disagree about something else. Part of the reason to also state that I agree is an attempt to keep the overall tone more positive. After all, the other person might be a highly experienced core developer who just did not happen to have gone though all the same thoughts regarding that specific question recently. I hope that has not been interpreted as arrogance such as "I know better than these people". For me, as one of the (many?) newcomers, especially on -dev, it can sometimes be difficult to tell whether not getting a reaction means "Good point, I agree", "I did not understand so I'll just ignore it", "I don't want to argue with you" or something else. Then again, someone just saying essentially the same thing without a reference a few posts later just feels strange. Also, if the only thing people apparently do is disagree about things, it makes the overall tone of the discussions at least *seem* very negative. From this point of view there seems to be some good in positive comments.
Unsupported (dis)agreement to statements about "needs" also may be taken as *rude*, because others may infer your arrogant claim to know what *they* do or don't need. Admittedly there's a difficult distinction here between Chris's *idiom* where "you don't need to" translates to "In my understanding, it is generally not necessary to", and your *unsupported* agreement, which in my dialect of English changes the emphasis to imply you know better than those who disagree with you and Chris. And, of course, the position that others are "too easily offended" is often reasonable, but you should be aware that there will be an impact on your reputation and ability to influence development of Python (even if it doesn't come near the point where a moderator invokes "Code of Conduct").
"Me too" posts aren't entirely forbidden, but I feel that in Python custom they are most appropriate when voting on bikeshed colors, and as applause for a *technically* excellent suggestion. They should be avoided in the context of value judgments (of "need" and "simplicity", for example) for the reason given above.
Personally, I've sometimes feeled the urge to give a positive comment just to make sure something gets noticed, or to help keep the discussion *not* go around in circles by pointing out more clearly the important points to the people not as involved in the topic of discussion. But I've tried to resist this urge when I don't have anything to add. I find the notion of S/N (signal-to-noise ratio), which you in fact brought up recently in another thread, very important.
When people want to use your library and it requires a string, the can simply use "my_path.path" and everything still works for them when they switch to pathlib.
This is disrespectful in tone. I don't know if you're responding to Ethan here, but he's one of the authors in question. We *know* that Ethan doesn't like such inelegant idioms -- he said so -- where "this object has an appropriate conversion to your argument type, so you should apply it implicitly" is unambiguous.[3] So for him, it's *not* so simple. Since it's not a matter of voting, each proponent should provide more contexts where preferred programming idioms are "Pythonic" to sway the sense of the community, or if necessary, the BDFL.
Where that aesthetic came up was in the context of consistently wrapping arguments that might be Paths in str, as in
p = Path(*stuff) or defaultstring # 500 lines crossing function and module boundaries! with open(str(p)) as f: process(f)
I think it was Nick who posted agreement with Ethan on the aesthetics of str-wrapping. If that were all, he probably wouldn't have posted (see fn. 1), but he further pointed out that this application of str is *dangerous* because *everything* in Python can be coerced to str. That was a very valuable observation, which swayed the list in favor of "Uh-oh, we can't recommend 'os.method(str(Path))'!"
This is my last post on this particular topic, but I will be happy to discuss off-list. (I may discuss further in public on my blog, but first I have to get a blog. :-)
Footnotes: [1] "You" is generic here. There are a couple of developers whose agreement has the status of pronouncement of Pythonicity. Aspire to that, but don't assume it -- very few have it, and it's actually *very* rarely exercised. And you can recognize them because they are *asked* to pronounce -- by people whose statements you thought were already authoritative!
[2] And even so votes are often overturned by later arguments, both theoretical and based in experience. See for example the several threads over time on the naming of Py_XSETREF.
[3] Interpreting Zen koans frequently requires figure-ground inversion. In this case we can apply "In the face of ambiguity, refuse to guess" in the form "in the absence of ambiguity, don't wait to be asked". I'm hardly authoritative, but FWIW :-) I think Ethan's esthetic sense here accords with Pythonicity.

On Tue, Apr 12, 2016 at 7:58 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
Sticking points: ---------------
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
I would say No and No, on the basis that it's *far* easier to widen their scope in 3.7 than to narrow it. Once you declare that one or both of these may return bytes, it becomes an annoying incompatibility to change that (even if it *is* marked provisional), which almost certainly means it won't happen. By restricting them both, we force the issue: if you want bytes, you'll know about it. I'd also prefer to stick to Unicode path names, for reasons I've stated in other threads. Undecodable path byte streams can be handled already, so what are we really gaining by allowing a Path-like object to emit bytes? If it becomes a major issue for a lot of types, it wouldn't be hard to add a helper function somewhere (or a mixin class that provides a ready-to-go __fspath__, which might well be sufficient). ChrisA

On 04/11/2016 02:58 PM, Ethan Furman wrote:
Sticking points: ---------------
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
On 04/11/2016 10:28 PM, Stephen J. Turnbull wrote:
In text applications, "bytes as carcinogen" is an apt metaphor.
On 04/12/2016 08:25 AM, Chris Angelico wrote:
I would say No and No, on the basis that it's *far* easier to widen their scope in 3.7 than to narrow it.
On 04/11/2016 08:45 PM, Nick Coghlan wrote:
I've come around to the point of view that allowing both str and bytes-like objects to pass through unchanged makes sense, with the rationale being the one someone mentioned regarding ease-of-use in os.path. [...] One possible way to address this concern would be to have the underlying protocol be bytes/str (since boundary code frequently needs to handle the paths-are-bytes assumption in POSIX), but offer an "os.fspathname" API that rejected bytes output from os.fspath.
I think this is the way forward: offer a standard way to get paths-as-strings, with an easily supported way of working with paths-as-bytes. This could be with on os.fspathname() & os.fspath() pair of functions, or with a single function that has a parameter specifying what to do with bytes objects: reject (default), accept, or (maybe) an encoding to use to coerce to bytes. -- ~Ethan~

Ethan Furman <ethan <at> stoneleaf.us> writes:
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
De-lurking. Especially since the ultimate goal is better interoperability, I feel like an implementation that people can play with would help guide the few remaining decisions. To help test the various options you could temporarily add a _allow_bytes=GLOBAL_CONFIG_OPTION default argument to both pathlib.__fspath__() and os.fspath(), with distinct configurable defaults for each. In the spirit of Python 3 I feel like bytes might not be needed in practice, but something like this with defaults of False will allow people to easily test all the various options.

On Tue, 12 Apr 2016 at 22:38 Michael Mysinger via Python-Dev < python-dev@python.org> wrote:
Ethan Furman <ethan <at> stoneleaf.us> writes:
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
De-lurking. Especially since the ultimate goal is better interoperability, I feel like an implementation that people can play with would help guide the few remaining decisions. To help test the various options you could temporarily add a _allow_bytes=GLOBAL_CONFIG_OPTION default argument to both pathlib.__fspath__() and os.fspath(), with distinct configurable defaults for each.
In the spirit of Python 3 I feel like bytes might not be needed in practice, but something like this with defaults of False will allow people to easily test all the various options.
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).

On 4/13/2016 13:10, Brett Cannon wrote:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
Number 4 is my personal favorite - it has a simple control flow path and is the least needlessly restrictive. (I could rant about needless restrictions, but I am about a decade late for that, so I wont bother.)

On 04/13/2016 10:22 AM, Alexander Walters wrote:
On 4/13/2016 13:10, Brett Cannon wrote:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
Number 4 is my personal favorite - it has a simple control flow path and is the least needlessly restrictive.
Number 3: it allows bytes, but only when told it's okay to do so. Having code get a bytes object when one is not expected is not a headache we need to inflict on anyone. -- ~Ethan~

On 4/13/2016 13:49, Ethan Furman wrote:
Number 3: it allows bytes, but only when told it's okay to do so. Having code get a bytes object when one is not expected is not a headache we need to inflict on anyone.
This is an artifact of the other needless restrictions I said I wouldn't rant about. I think it is in the best interest not to perpetuate those needless restrictions.

In the spirit of Python 3 I feel like bytes might not be needed in practice, but something like this with defaults of False will allow people to easily test all the various options.
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has
Brett Cannon <brett <at> python.org> writes: the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed). Either number 1 or number 3 for me (I don't think bytes path-like objects are useful in Python). Regards Antoine.

On Thu, Apr 14, 2016 at 3:10 AM, Brett Cannon <brett@python.org> wrote:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
All of them have this construct: try: path = path.__fspath__() except AttributeError: pass Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C: try: callme = path.__fspath__ except AttributeError: pass else: path = callme() ChrisA

On Wed, 13 Apr 2016 at 12:25 Chris Angelico <rosuav@gmail.com> wrote:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with
On Thu, Apr 14, 2016 at 3:10 AM, Brett Cannon <brett@python.org> wrote: the
allow_bytes approach I originally proposed).
All of them have this construct:
try: path = path.__fspath__() except AttributeError: pass
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
I'm assuming the C code will do what you're suggesting. My way is just faster to write in 2 minutes of coding. :)

On Thu, Apr 14, 2016 at 5:30 AM, Brett Cannon <brett@python.org> wrote:
On Wed, 13 Apr 2016 at 12:25 Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Apr 14, 2016 at 3:10 AM, Brett Cannon <brett@python.org> wrote:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
All of them have this construct:
try: path = path.__fspath__() except AttributeError: pass
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
I'm assuming the C code will do what you're suggesting. My way is just faster to write in 2 minutes of coding. :)
Cool cool. Just checking! You're already aware that my preference is for the first one, str-only. I don't think the second one has much value (a path-like object can only ever return a str, but a bytes can be passed through unchanged?), and the fourth strikes me as a bad idea (just allowing bytes any time). So my votes are +1, -0.5, +0, -1. ChrisA

On Wed, Apr 13, 2016 at 3:24 PM, Chris Angelico <rosuav@gmail.com> wrote:
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
+1 for this variant; I really don't like masking errors inside the __fspath__ implementation. -Fred -- Fred L. Drake, Jr. <fred at fdrake.net> "A storm broke loose in my mind." --Albert Einstein

On Wed, 13 Apr 2016 at 12:39 Fred Drake <fred@fdrake.net> wrote:
On Wed, Apr 13, 2016 at 3:24 PM, Chris Angelico <rosuav@gmail.com> wrote:
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
+1 for this variant; I really don't like masking errors inside the __fspath__ implementation.
Don't read too much into the code in that gist. I just did them quickly to get the point across of the proposals in terms of str/bytes, not what will be proposed in any final patch.

so are we worried that __fspath__ will exist and be callable, but might raise an AttributeError somewhere inside itself? if so isn't it broken anyway, so should it be ignored? and I know it's asking poermission rather than forgiveness, but what's wrong with: if hasattr(path, "__fspath__"): path = path.__fspath__() if you really want to check for the existence of the attribute first? or even: path = path.__fspath__ if hasattr(path, "__fspath__") else path (OK, really a Pythonic style question now....) -CHB On Wed, Apr 13, 2016 at 12:54 PM, Brett Cannon <brett@python.org> wrote:
On Wed, 13 Apr 2016 at 12:39 Fred Drake <fred@fdrake.net> wrote:
On Wed, Apr 13, 2016 at 3:24 PM, Chris Angelico <rosuav@gmail.com> wrote:
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
+1 for this variant; I really don't like masking errors inside the __fspath__ implementation.
Don't read too much into the code in that gist. I just did them quickly to get the point across of the proposals in terms of str/bytes, not what will be proposed in any final patch.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Wed, 13 Apr 2016 at 13:40 Chris Barker <chris.barker@noaa.gov> wrote:
so are we worried that __fspath__ will exist and be callable, but might raise an AttributeError somewhere inside itself? if so isn't it broken anyway, so should it be ignored?
It should propagate instead of swallowing up the exception, otherwise it's hard to debug why __fspath__ seems to be ignored.
and I know it's asking permission rather than forgiveness, but what's wrong with:
if hasattr(path, "__fspath__"): path = path.__fspath__()
if you really want to check for the existence of the attribute first?
Nothing.
or even:
path = path.__fspath__ if hasattr(path, "__fspath__") else path
That also works.
(OK, really a Pythonic style question now....)
Yes, this is getting a bit side-tracked over some example code to just get a concept across. -Brett
-CHB
On Wed, Apr 13, 2016 at 12:54 PM, Brett Cannon <brett@python.org> wrote:
On Wed, 13 Apr 2016 at 12:39 Fred Drake <fred@fdrake.net> wrote:
On Wed, Apr 13, 2016 at 3:24 PM, Chris Angelico <rosuav@gmail.com> wrote:
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
try: callme = path.__fspath__ except AttributeError: pass else: path = callme()
+1 for this variant; I really don't like masking errors inside the __fspath__ implementation.
Don't read too much into the code in that gist. I just did them quickly to get the point across of the proposals in terms of str/bytes, not what will be proposed in any final patch.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov

On Wed, Apr 13, 2016, at 16:39, Chris Barker wrote:
so are we worried that __fspath__ will exist and be callable, but might raise an AttributeError somewhere inside itself? if so isn't it broken anyway, so should it be ignored?
Well, if you're going to say "ignore the protocol because it's broken", where do you stop? What if it raises some other exception? What if it raises SystemExit?

On Wed, Apr 13, 2016 at 1:47 PM, Random832 <random832@fastmail.com> wrote:
On Wed, Apr 13, 2016, at 16:39, Chris Barker wrote:
so are we worried that __fspath__ will exist and be callable, but might raise an AttributeError somewhere inside itself? if so isn't it broken anyway, so should it be ignored?
Well, if you're going to say "ignore the protocol because it's broken", where do you stop? What if it raises some other exception? What if it raises SystemExit?
this is pretty much always the case with EAFTP coding: try: something() except SomeError: do_something_else() unless SomeError is a custom defined error that you know is never going to get raised anywhere else, then something() could raise SomeError for the reason you expect, or some code deep in the call stack could raise SomeError also, and you wouldn't know that. I had a student run into this and it took him a good while to debug it. But that was because the code in something() was pretty darn buggy. If he had tested something() by itself, there would have been no issue finding the problem. In this case, I don't know that we need to be tolerant of buggy __fspathname__() implementations -- they should be tested outside these checks, and not be buggy. So a buggy implementation may raise and may be ignored, depending on what Exception the bug triggers -- big deal. The only time it would matter is when the implementer is debugging the implementation. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 04/13/2016 05:06 PM, Chris Barker wrote:
In this case, I don't know that we need to be tolerant of buggy __fspathname__() implementations -- they should be tested outside these checks, and not be buggy. So a buggy implementation may raise and may be ignored, depending on what Exception the bug triggers -- big deal. The only time it would matter is when the implementer is debugging the implementation.
Yet the idea behind robust exception handling is to test as little as possible and only catch what you know how to correct. This code catches only one thing, only at one place, and we know how to deal with it: try: fsp = obj.__fspath__ except AttributeError: pass else: fsp = fsp() Contrarily, this next code catches the same error, but it could happen at the one place we know how to deal with it *or* anywhere further down the call stack where we have no clue what the proper course is to handle the problem... yet we suppress it anyway: try: fsp = obj.__fspath__() except AttributeError: pass Certainly not code I want to see in the stdlib. -- ~Ethan~

On Thu, Apr 14, 2016 at 5:46 AM, Random832 <random832@fastmail.com> wrote:
On Wed, Apr 13, 2016, at 15:24, Chris Angelico wrote:
Is that the intention, or should the exception catching be narrower? I know it's clunky to write it in Python, but AIUI it's less so in C:
How is it less so in C? You lose the ability to PyObject_CallMethod.
I might be wrong, then. Wasn't sure how it was all implemented. Anyway, it's a correctness thing, not a simplicity one, so even if it is clunkier, it ought to be the case. And that is the intention, so we're fine. ChrisA

Oh, since others voted, I will also vote and explain my vote. I like choice 1, str only, because it's very well defined. In Python 3, Unicode is simply the native type for text. It's accepted by almost all functions. In other emails, I also explained that Unicode is fine to store undecodable filenames on UNIX, it works as expected since many years (since Python 3.3). -- If you cannot survive without bytes, I suggest to add two functions: one for str only, another which can return str or bytes. Maybe you want in fact two protocols: __fspath__(str only) and __fspathb__ (bytes only)? os.fspathb() would first try __fspathb__, or fallback to os.fsencode(__fspath__). os.fspath() would first try __fspath__, or fallback to os.fsdecode(__fspathb__). IMHO it's not worth to have such complexity while Unicode handles all use cases. Or do you know functions implemented in Python accepting str *and* bytes? -- The C implementation of the os module has an important path_converter() function: * path_converter accepts (Unicode) strings and their * subclasses, and bytes and their subclasses. What * it does with the argument depends on the platform: * * * On Windows, if we get a (Unicode) string we * extract the wchar_t * and return it; if we get * bytes we extract the char * and return that. * * * On all other platforms, strings are encoded * to bytes using PyUnicode_FSConverter, then we * extract the char * from the bytes object and * return that. This function will implement something like os.fspath(). With os.fspath() only accepting str, we will return directly the Unicode string on Windows. On UNIX, Unicode will be encoded, as it's already done for Unicode strings. This specific function would benefit of the flavor 4 (os.fspath() can return str and bytes), but it's more an exception than the rule. I would be more a micro-optimization than a good reason to drive the API design. Victor Le mercredi 13 avril 2016, Brett Cannon <brett@python.org> a écrit :
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).

On Wed, 13 Apr 2016 at 15:20 Victor Stinner <victor.stinner@gmail.com> wrote:
Oh, since others voted, I will also vote and explain my vote.
I like choice 1, str only, because it's very well defined. In Python 3, Unicode is simply the native type for text. It's accepted by almost all functions. In other emails, I also explained that Unicode is fine to store undecodable filenames on UNIX, it works as expected since many years (since Python 3.3).
--
If you cannot survive without bytes, I suggest to add two functions: one for str only, another which can return str or bytes.
Maybe you want in fact two protocols: __fspath__(str only) and __fspathb__ (bytes only)? os.fspathb() would first try __fspathb__, or fallback to os.fsencode(__fspath__). os.fspath() would first try __fspath__, or fallback to os.fsdecode(__fspathb__). IMHO it's not worth to have such complexity while Unicode handles all use cases.
Implementing two magic methods for this seems like overkill. Best I would be willing to do with automatic encode/decode is use os.fsencode()/os.fsdecode() on the argument or what __fspath__() returned.
Or do you know functions implemented in Python accepting str *and* bytes?
On purpose, nothing off the top of my head.
--
The C implementation of the os module has an important path_converter() function:
* path_converter accepts (Unicode) strings and their * subclasses, and bytes and their subclasses. What * it does with the argument depends on the platform: * * * On Windows, if we get a (Unicode) string we * extract the wchar_t * and return it; if we get * bytes we extract the char * and return that. * * * On all other platforms, strings are encoded * to bytes using PyUnicode_FSConverter, then we * extract the char * from the bytes object and * return that.
This function will implement something like os.fspath().
With os.fspath() only accepting str, we will return directly the Unicode string on Windows. On UNIX, Unicode will be encoded, as it's already done for Unicode strings.
This specific function would benefit of the flavor 4 (os.fspath() can return str and bytes), but it's more an exception than the rule. I would be more a micro-optimization than a good reason to drive the API design.
Yep, it's interesting to know but Chris and I won't let it drive the decision (I assume). -Brett
Victor
Le mercredi 13 avril 2016, Brett Cannon <brett@python.org> a écrit :
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1
has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).

On Apr 13 2016, Brett Cannon <brett@python.org> wrote:
On Tue, 12 Apr 2016 at 22:38 Michael Mysinger via Python-Dev < python-dev@python.org> wrote:
Ethan Furman <ethan <at> stoneleaf.us> writes:
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
De-lurking. Especially since the ultimate goal is better interoperability, I feel like an implementation that people can play with would help guide the few remaining decisions. To help test the various options you could temporarily add a _allow_bytes=GLOBAL_CONFIG_OPTION default argument to both pathlib.__fspath__() and os.fspath(), with distinct configurable defaults for each.
In the spirit of Python 3 I feel like bytes might not be needed in practice, but something like this with defaults of False will allow people to easily test all the various options.
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
When passing an object that is of type str and has a __fspath__ attribute, all approaches return the value of __fspath__(). However, when passing something of type bytes, the second approach returns the object, while the third returns the value of __fspath__(). Is this intentional? I think a __fspath__ attribute should always be preferred. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

On 04/13/2016 03:45 PM, Nikolaus Rath wrote:
When passing an object that is of type str and has a __fspath__ attribute, all approaches return the value of __fspath__().
However, when passing something of type bytes, the second approach returns the object, while the third returns the value of __fspath__().
Is this intentional? I think a __fspath__ attribute should always be preferred.
Yes, it is intentional. The second approach assumes __fspath__ can only contain str, so there is no point in checking it for bytes. -- ~Ethan~

On Apr 13 2016, Ethan Furman <ethan@stoneleaf.us> wrote:
On 04/13/2016 03:45 PM, Nikolaus Rath wrote:
When passing an object that is of type str and has a __fspath__ attribute, all approaches return the value of __fspath__().
However, when passing something of type bytes, the second approach returns the object, while the third returns the value of __fspath__().
Is this intentional? I think a __fspath__ attribute should always be preferred.
Yes, it is intentional. The second approach assumes __fspath__ can only contain str, so there is no point in checking it for bytes.
Either I haven't understood your answer, or you haven't understood my question. I'm concerned about this case: class Special(bytes): def __fspath__(self): return 'str-val' obj = Special('bytes-val', 'utf8') path_obj = fspath(obj, allow_bytes=True) With #2, path_obj == 'bytes-val'. With #3, path_obj == 'str-val'. I would expect that fspath(obj, allow_bytes=True) == 'str-val' (after all, it's allow_bytes, not require_bytes). Bu Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

On 04/13/2016 07:57 PM, Nikolaus Rath wrote:
On Apr 13 2016, Ethan Furman wrote:
On 04/13/2016 03:45 PM, Nikolaus Rath wrote:
When passing an object that is of type str and has a __fspath__ attribute, all approaches return the value of __fspath__().
However, when passing something of type bytes, the second approach returns the object, while the third returns the value of __fspath__().
Is this intentional? I think a __fspath__ attribute should always be preferred.
Yes, it is intentional. The second approach assumes __fspath__ can only contain str, so there is no point in checking it for bytes.
Either I haven't understood your answer, or you haven't understood my question. I'm concerned about this case:
class Special(bytes): def __fspath__(self): return 'str-val' obj = Special('bytes-val', 'utf8') path_obj = fspath(obj, allow_bytes=True)
With #2, path_obj == 'bytes-val'. With #3, path_obj == 'str-val'.
I misunderstood your question. That is... an interesting case. ;) -- ~Ethan~

On 14 April 2016 at 13:14, Ethan Furman <ethan@stoneleaf.us> wrote:
On 04/13/2016 07:57 PM, Nikolaus Rath wrote:
Either I haven't understood your answer, or you haven't understood my question. I'm concerned about this case:
class Special(bytes): def __fspath__(self): return 'str-val' obj = Special('bytes-val', 'utf8') path_obj = fspath(obj, allow_bytes=True)
With #2, path_obj == 'bytes-val'. With #3, path_obj == 'str-val'.
I misunderstood your question. That is... an interesting case. ;)
In this kind of case, inheritance tends to trump protocol. For example, int subclasses can't override operator.index:
from operator import index class NotAnInt(): ... def __index__(self): ... return 42 ... index(NotAnInt()) 42 class MyInt(int): ... def __index__(self): ... return 42 ... index(MyInt(53)) 53
The reasons for that behaviour are more pragmatic than philosophical: builtins and their subclasses are extensively special-cased for speed reasons, and those shortcuts are encountered before the interpreter even considers using the general protocol. In cases where the magic method return types are polymorphic (so subclasses may want to override them) we'll use more restrictive exact type checks for the shortcuts, but that argument doesn't apply for typechecked protocols where the result is required to be an instance of a particular builtin type (but subclasses are considered acceptable). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Apr 13, 2016, at 23:27, Nick Coghlan wrote:
In this kind of case, inheritance tends to trump protocol. For example, int subclasses can't override operator.index: ... The reasons for that behaviour are more pragmatic than philosophical: builtins and their subclasses are extensively special-cased for speed reasons, and those shortcuts are encountered before the interpreter even considers using the general protocol.
In cases where the magic method return types are polymorphic (so subclasses may want to override them) we'll use more restrictive exact type checks for the shortcuts, but that argument doesn't apply for typechecked protocols where the result is required to be an instance of a particular builtin type (but subclasses are considered acceptable).
Then why aren't we doing it for str? Because "try: path = path.__fspath__()" is more idiomatic than the alternative? If some sort of reasoned decision has been made to require the protocol to trump the special case for str subclasses, it's unreasonable not to apply the same decision to bytes subclasses. The decision should be "always use the protocol first" or "always use the type match first". In other words, why not this: def fspath(path, *, allow_bytes=False): if isinstance(path, (bytes, str) if allow_bytes else str) return path try: m = path.__fspath__ except AttributeError: raise TypeError path = m() if isinstance(path, (bytes, str) if allow_bytes else str) return path raise TypeError

On 14 April 2016 at 14:05, Random832 <random832@fastmail.com> wrote:
On Wed, Apr 13, 2016, at 23:27, Nick Coghlan wrote:
In this kind of case, inheritance tends to trump protocol. For example, int subclasses can't override operator.index: ... The reasons for that behaviour are more pragmatic than philosophical: builtins and their subclasses are extensively special-cased for speed reasons, and those shortcuts are encountered before the interpreter even considers using the general protocol.
In cases where the magic method return types are polymorphic (so subclasses may want to override them) we'll use more restrictive exact type checks for the shortcuts, but that argument doesn't apply for typechecked protocols where the result is required to be an instance of a particular builtin type (but subclasses are considered acceptable).
Then why aren't we doing it for str? Because "try: path = path.__fspath__()" is more idiomatic than the alternative?
The sketches Brett posted will bear little resemblance to the actual implementation - that will be in C and use similar idioms to those we use for other abstract protocols (such as shortcuts for instances of builtin types, and doing the method lookup via the passed in object's type, rather than on the instance). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Apr 13, 2016, at 8:31 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
class Special(bytes): def __fspath__(self): return 'str-val' obj = Special('bytes-val', 'utf8') path_obj = fspath(obj, allow_bytes=True)
With #2, path_obj == 'bytes-val'. With #3, path_obj == 'str-val'.
In this kind of case, inheritance tends to trump protocol.
Sure, but...
example, int subclasses can't override operator.index: ... The reasons for that behaviour are more pragmatic than philosophical: builtins and their subclasses are extensively special-cased for speed reasons,
OK, but in this case, purity can beat practicality. If the author writes an __fspath__ method, presumably it's because it should be used. And I can certainly imagine one might want to store a path representation as bytes, but NOT want the raw bytes passed off to file handling libs. (of course you could use composition rather than subclassing if you had to) -CHB

On 17 April 2016 at 04:47, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
On Apr 13, 2016, at 8:31 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
class Special(bytes): def __fspath__(self): return 'str-val' obj = Special('bytes-val', 'utf8') path_obj = fspath(obj, allow_bytes=True)
With #2, path_obj == 'bytes-val'. With #3, path_obj == 'str-val'.
In this kind of case, inheritance tends to trump protocol.
Sure, but...
example, int subclasses can't override operator.index: ... The reasons for that behaviour are more pragmatic than philosophical: builtins and their subclasses are extensively special-cased for speed reasons,
OK, but in this case, purity can beat practicality. If the author writes an __fspath__ method, presumably it's because it should be used.
And I can certainly imagine one might want to store a path representation as bytes, but NOT want the raw bytes passed off to file handling libs.
(of course you could use composition rather than subclassing if you had to)
Exactly - inheritance is a really strong relationship that directly affects the in-memory layout of instances (at least in CPython), and also the kinds of assumption other code will make about that type (for example, subclasses are special cased to allow them to override the behaviour of numeric binary operators when they appear as the right operand with an instance of the parent type as the left operand, while with unrelated types, the left operand always gets the first chance to handle the operation). When folks don't want to trigger those "this is an <X>" behaviours, the appropriate design pattern is composition, not inheritance (and many of the ABCs were introduced to make it easier to implement particular interfaces without inheriting from the corresponding builtin types). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, 13 Apr 2016 at 15:46 Nikolaus Rath <Nikolaus@rath.org> wrote:
On Tue, 12 Apr 2016 at 22:38 Michael Mysinger via Python-Dev < python-dev@python.org> wrote:
Ethan Furman <ethan <at> stoneleaf.us> writes:
Do we allow bytes to be returned from os.fspath()? If yes, then do we allow bytes from __fspath__()?
De-lurking. Especially since the ultimate goal is better interoperability, I feel like an implementation that people can play with would help guide
few remaining decisions. To help test the various options you could temporarily add a _allow_bytes=GLOBAL_CONFIG_OPTION default argument to both pathlib.__fspath__() and os.fspath(), with distinct configurable defaults for each.
In the spirit of Python 3 I feel like bytes might not be needed in practice, but something like this with defaults of False will allow people to easily test all the various options.
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with
On Apr 13 2016, Brett Cannon <brett@python.org> wrote: the the
allow_bytes approach I originally proposed).
When passing an object that is of type str and has a __fspath__ attribute, all approaches return the value of __fspath__().
However, when passing something of type bytes, the second approach returns the object, while the third returns the value of __fspath__().
Is this intentional? I think a __fspath__ attribute should always be preferred.
It's very much intentional. If we define __fspath__() to only return strings but still want to minimize boilerplate of allowing bytes to simply pass through without checking a path argument to see if it is bytes then approach #2 is warranted. But if __fspath__() can return bytes then approach #3 allows for it.

https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has
Brett Cannon <brett <at> python.org> writes: the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
Thanks Brett, it is definitely a start! Maybe I am just more unimaginative than most, but since interoperability is the goal, I would ideally be able to play with a full implementation where all the stdlib functions Nick originally mentioned accepted these "rich path" objects. However, for concrete example purposes, maybe it is sufficient to start with your fspath function, a toy RichPath class implementing __fspath__, and something like os.path.join, which is a meaty enough example to test some of the functionality. I posted a gist of a string only example at https://gist.github.com/mmysinger/0b5ae2cfb866f7013c387a2683c7fc39 After playing with and considering the 4 possibilities, anything where __fspath__ can return bytes seems like insanity that flies in the face of everything Python 3 is trying to accomplish. In particular, one RichPath class might return bytes and another str, or even worse the same class might sometimes return bytes and sometimes str. When will os.path.join blow up due to mixing bytes and str and when will it work in those situations? So for me that eliminates #3 and #4. Also the version #2 accepting bytes in os.fspath felt like it could be a very minor convenience, but even the str only version #1 is just requires one isinstance check in the rare case you need to also deal with bytes (see the os.path.join example in the gist above). So I lean toward the str only #1 version. In any case I would start with the strict str only full implementation and loosen it either in 3.6 or 3.7 depending on what people think after actually using it.

On 04/14/2016 12:03 AM, Michael Mysinger via Python-Dev wrote:
Brett Cannon writes:
After playing with and considering the 4 possibilities, anything where __fspath__ can return bytes seems like insanity that flies in the face of everything Python 3 is trying to accomplish. In particular, one RichPath class might return bytes and another str, or even worse the same class might sometimes return bytes and sometimes str. When will os.path.join blow up due to mixing bytes and str and when will it work in those situations?
What are you asking here? Exactly where in os.join mixing bytes & str the exception will occur, or will mixing bytes & str ever work? The answer to the first is irrelevant (except for performance). The answer to the second is always/never. Meaning allowing os.fspath() and __fspath__ to return either bytes or str will never cause the combination of bytes and str to work. Said another way: if you are using os.path.join then all the pieces have be str or all the pieces have to be bytes. -- ~Ethan~

Ethan Furman <ethan <at> stoneleaf.us> writes:
On 04/14/2016 12:03 AM, Michael Mysinger via Python-Dev wrote:
In particular, one RichPath class might return bytes and another str, or even worse the same class might sometimes return bytes and sometimes str. When will os.path.join blow up due to mixing bytes and str and when will it work in those situations?
What are you asking here? ... Meaning allowing os.fspath() and __fspath__ to return either bytes or str will never cause the combination of bytes and str to work. Said another way: if you are using os.path.join then all the pieces have be str or all the pieces have to be bytes.
I am saying that if os.path.join now accepts RichPath objects, and those objects can return either str or bytes, then its much harder to reason about when I have all bytes or all strings. In essence, you will force me to pre- wrap all RichPath objects in either os.fsencode(os.fspath(path)) or os.fsdecode(os.fspath(path)), just so I can reason about the type. And if I have to always do that wrapping then os.path.join doesn't need to accept RichPath objects and call fspath at all.

On Apr 14, 2016, at 11:59 AM, Michael Mysinger via Python-Dev <python-dev@python.org> wrote:
In essence, you will force me to pre- wrap all RichPath objects in either os.fsencode(os.fspath(path)) or os.fsdecode(os.fspath(path)), just so I can reason about the type.
This is only the case if you have a singular RichPath object that can represent both bytes and str (which is what DirEntry does, which I agree makes it harder… but that’s already the case with DirEntry.path). However that’s not the case if you have a bRichPath and uRichPath. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Donald Stufft <donald <at> stufft.io> writes:
On Apr 14, 2016, at 11:59 AM, Michael Mysinger via Python-Dev <python-dev <at> python.org> wrote:
In essence, you will force me to pre- wrap all RichPath objects in either os.fsencode(os.fspath(path)) or os.fsdecode(os.fspath(path)), just so I can reason about the type.
This is only the case if you have a singular RichPath object that can represent both bytes and str (which is what DirEntry does, which I agree makes it harder… but that’s already the case with DirEntry.path). However that’s not the case if you have a bRichPath and uRichPath.
And you might even be able to retain your sanity if you enforce any particular class to be either bRichPath or uRichPath. But if you do that, then that still leaves DirEntry out in the cold, likely converting to str in its __fspath__. Which leaves me in the camp that bRichPath falls under YAGNI, and RichPath should be str only.

On 04/14/2016 08:59 AM, Michael Mysinger via Python-Dev wrote:
I am saying that if os.path.join now accepts RichPath objects, and those objects can return either str or bytes, then its much harder to reason about when I have all bytes or all strings. In essence, you will force me to pre- wrap all RichPath objects in either os.fsencode(os.fspath(path)) or os.fsdecode(os.fspath(path)), just so I can reason about the type. And if I have to always do that wrapping then os.path.join doesn't need to accept RichPath objects and call fspath at all.
What many folks seem to be missing is that *you* (generic you) have control of your data. If you are not working at the bytes layer, you shouldn't be getting bytes objects because: - you specified str when asking for data from the OS, or - you transformed the incoming bytes from whatever external source to str when you received them. -- ~Ethan~

On 14 April 2016 at 17:46, Ethan Furman <ethan@stoneleaf.us> wrote:
On 04/14/2016 08:59 AM, Michael Mysinger via Python-Dev wrote:
I am saying that if os.path.join now accepts RichPath objects, and those objects can return either str or bytes, then its much harder to reason about when I have all bytes or all strings. In essence, you will force me to pre- wrap all RichPath objects in either os.fsencode(os.fspath(path)) or os.fsdecode(os.fspath(path)), just so I can reason about the type. And if I have to always do that wrapping then os.path.join doesn't need to accept RichPath objects and call fspath at all.
What many folks seem to be missing is that *you* (generic you) have control of your data.
If you are not working at the bytes layer, you shouldn't be getting bytes objects because:
- you specified str when asking for data from the OS, or - you transformed the incoming bytes from whatever external source to str when you received them.
My experience is that (particularly with code that was originally written for Python 2) "you have control of your data" is often an illusion - bytes can appear in code from unexpected sources, and when they do I'd rather see an error if I'm using code where I expect a string. Certainly that's a bug in the code - all I'm saying is that it fail early rather than late. Having said this, I don't have an actual use case - but equally it seems to me that our problem is that *nobody* does (yet) because uptake of pathlib has been slow, thanks to limited stdlib support. My view remains that we should get the (relatively simple and uncontroversial) str support in place, and defer bytes support for when we have experience with that. I'd appreciate it if anyone can clarify why "gracefully extending" the protocol to include bytes support at a later date isn't practical. Paul

On 04/14/2016 10:22 AM, Paul Moore wrote:
On 14 April 2016 at 17:46, Ethan Furman wrote:
If you are not working at the bytes layer, you shouldn't be getting bytes objects because:
- you specified str when asking for data from the OS, or - you transformed the incoming bytes from whatever external source to str when you received them.
My experience is that (particularly with code that was originally written for Python 2) "you have control of your data" is often an illusion - bytes can appear in code from unexpected sources, and when they do I'd rather see an error if I'm using code where I expect a string. Certainly that's a bug in the code - all I'm saying is that it fail early rather than late.
If we have one function that uses a flag and you leave the flag alone (it defaults to rejecting bytes) -- voila! An error is raised when bytes show up.
I'd appreciate it if anyone can clarify why "gracefully extending" the protocol to include bytes support at a later date isn't practical.
It's going to be a bunch of work. I don't want to do the work twice. On the other hand, if while doing the work it becomes apparent that supporting bytes and str in the protocol is either infeasible, confusing, or a plain ol' bad idea I have no problem ripping out the bytes support and going to str only. -- ~Ethan~

On Thu, Apr 14, 2016 at 7:46 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
What many folks seem to be missing is that *you* (generic you) have control of your data.
If you are not working at the bytes layer, you shouldn't be getting bytes objects because:
- you specified str when asking for data from the OS, or - you transformed the incoming bytes from whatever external source to str when you received them.
There is an apparent contradiction of the above with some previous posts, including your own. Let me try to fix it: Code that deals with paths can be divided in groups as follows: (1) Code that has access to pathname/filename data and has some level of control over what data type comes in. This code may for instance choose to deal with either bytes or str (2) Code that takes the path or file name that it happens to get and does something with it. This type of code can be divided into subgroups as follows: (2a) Code that accepts only one type of paths (e.g. str, bytes or pathlib) and fails if it gets something else. (2b) Code that wants to support different types of paths such as str, bytes or pathlib objects. This includes os.path.*, os.scandir, and various other standard library code. Presumably there is also third-party code that does the same. These functions may want to preserve the str-ness or bytes-ness of the paths in case they return paths, as the stdlib now does. But new code may even want to return pathlib objects when they get such objects as inputs. This is the duck-typing or polymorphic code we have been talking about. Code of this type (2b) may want to avoid implicit conversions because it makes the life of code of the other types more difficult. (feel free to fill in more categories of code) So the code of type (2b) is trying to make all categories happy by returning objects of the same type that it gets as input, while the other categories are probably in the situation where they don't necessarily need to make other categories of code happy. And the question is this: Do we need to make code using both bytes *and* scandir happy? This is largely the same question as whether we have to support bytes in addition to str in the protocol. (We may of course talk about third-party path libraries that have the same problem as scandir's DirEntry. Ethan's library is not exactly in the same category as DirEntry since its path objects *are* instances of bytes or str and therefore do not need this protocol to begin with, except perhaps for conversions from other high-level path types so that different path libraries work together nicely). -Koos

On Thu, Apr 14, 2016, at 13:56, Koos Zevenhoven wrote:
(1) Code that has access to pathname/filename data and has some level of control over what data type comes in. This code may for instance choose to deal with either bytes or str
(2) Code that takes the path or file name that it happens to get and does something with it. This type of code can be divided into subgroups as follows:
(2a) Code that accepts only one type of paths (e.g. str, bytes or pathlib) and fails if it gets something else.
Ideally, these should go away.
(2b) Code that wants to support different types of paths such as str, bytes or pathlib objects. This includes os.path.*, os.scandir, and various other standard library code. Presumably there is also third-party code that does the same. These functions may want to preserve the str-ness or bytes-ness of the paths in case they return paths, as the stdlib now does. But new code may even want to return pathlib objects when they get such objects as inputs.
Hold on. None of the discussion I've seen has included any way to specify how to construct a new object representing a different path other than the ones passed in. Surely you're not suggesting type(a)(b). Also, how does DirEntry fit in with any of this?
This is the duck-typing or polymorphic code we have been talking about. Code of this type (2b) may want to avoid implicit conversions because it makes the life of code of the other types more difficult.
As long as the type it returns is still a path/bytes/str (and therefore can be accepted when the caller passes it somewhere else) what's the problem?

On Thu, Apr 14, 2016 at 9:35 PM, Random832 <random832@fastmail.com> wrote:
On Thu, Apr 14, 2016, at 13:56, Koos Zevenhoven wrote:
(1) Code that has access to pathname/filename data and has some level of control over what data type comes in. This code may for instance choose to deal with either bytes or str
(2) Code that takes the path or file name that it happens to get and does something with it. This type of code can be divided into subgroups as follows:
(2a) Code that accepts only one type of paths (e.g. str, bytes or pathlib) and fails if it gets something else.
Ideally, these should go away.
I don't think so. (1) might even be the most common type of all code. This is code that gets a path from user input, from a config file, from a database etc. and then does things with it, typically including passing it to type (2) code and potentially getting a path back from there too.
(2b) Code that wants to support different types of paths such as str, bytes or pathlib objects. This includes os.path.*, os.scandir, and various other standard library code. Presumably there is also third-party code that does the same. These functions may want to preserve the str-ness or bytes-ness of the paths in case they return paths, as the stdlib now does. But new code may even want to return pathlib objects when they get such objects as inputs.
Hold on. None of the discussion I've seen has included any way to specify how to construct a new object representing a different path other than the ones passed in. Surely you're not suggesting type(a)(b).
That's right. This protocol is not solving the issue of returning 'rich' path objects. It's solving the issue of passing those objects to lower-level functions or to interact with other 'rich' path types. What I meant by this is that there may be code that *does* want to do type(a)(b), which is out of our control. Maybe I should not have mentioned that.
Also, how does DirEntry fit in with any of this?
os.scandir + DirEntry are one of the many things in the stdlib that give you pathnames of the same type as those that were put in.
This is the duck-typing or polymorphic code we have been talking about. Code of this type (2b) may want to avoid implicit conversions because it makes the life of code of the other types more difficult.
As long as the type it returns is still a path/bytes/str (and therefore can be accepted when the caller passes it somewhere else) what's the problem?
No, because not all paths are passed to the function that does the implicit conversion, and then when for instance os.path.joining two paths of a differenty type, it raises an error. In other words: Most non-library code (even library code?) deals with one specific type and does not want implicit conversions to other types. Some code (2b) deals with several types and, at least in the stdlib, such code returns paths of the same type as they are given, which makes said "most non-library code" happy, because it does not force the programmer to think about type conversions. (Then there is also code that explicitly deals with type conversions, such as os.fsencode and os.fsdecode.) -Koos

2016-04-13 19:10 GMT+02:00 Brett Cannon <brett@python.org>:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
IMHO the best argument against the flavor 4 (fspath: str or bytes allowed) is the os.path.join() function. I consider that the final goal of the whole discussion is to support something like: path = os.path.join(pathlib_path, "str_path", direntry) Even if direntry uses a bytes filename. I expect genericpath.join() to be patched to use os.fspath(). If os.fspath() returns bytes, path.join() will fail with an annoying TypeError. I expect that DirEntry.__fspath__ uses os.fsdecode() to return str, just to make my life easier. I recall that I used to say that Python 2 doesn't support Unicode filenames because os.path.join() raises a UnicodeDecodeError when you try to join a Unicode filename with a byte filename which contains non-ASCII bytes. The problem occurs indirectly in code using hardcoded paths, Unicode or bytes paths. Saying that "Python 2 doesn't support Unicode filenames" is wrong, but since Unicode is an hard problem, I tried to simplify my explanation :-) You can apply the same rationale for the flavors 2 and 3 (os.fspath(path, allow_bytes=True)). Indirectly, you will get similar TypeError on os.path.join(). Victor

On 14 April 2016 at 22:16, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-04-13 19:10 GMT+02:00 Brett Cannon <brett@python.org>:
https://gist.github.com/brettcannon/b3719f54715787d54a206bc011869aa1 has the four potential approaches implemented (although it doesn't follow the "separate functions" approach some are proposing and instead goes with the allow_bytes approach I originally proposed).
IMHO the best argument against the flavor 4 (fspath: str or bytes allowed) is the os.path.join() function.
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
That's not a *new* problem though, it already exists if you pass in a mix of bytes and str:
import os.path os.path.join("str", b"bytes") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/posixpath.py", line 89, in join "components") from None TypeError: Can't mix strings and bytes in path components
There's also already a solution (regardless of whether you want bytes or str as the result), which is to explicitly coerce all the arguments to the same type:
os.path.join(*map(os.fsdecode, ("str", b"bytes"))) 'str/bytes' os.path.join(*map(os.fsencode, ("str", b"bytes"))) b'str/bytes'
Assuming os.fsdecode and os.fsencode are updated to call os.fspath on their argument before continuing with the current logic, the latter two forms would both start automatically handling both DirEntry and pathlib objects, while the first form would continue to throw TypeError if handed an unexpected bytes value (whether directly or via an __fspath__ call). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Apr 14, 2016, at 09:40, Nick Coghlan wrote:
That's not a *new* problem though, it already exists if you pass in a mix of bytes and str:
There's also already a solution (regardless of whether you want bytes or str as the result), which is to explicitly coerce all the arguments to the same type:
It'd be nice if that went away. Having to do that makes about as much sense to me as if you had to explicitly coerce an int to a float to add them together. Sure, explicit is better than implicit, but there are limits. You're explicitly calling os.path.join; isn't that explicit enough?

On Thu, Apr 14, 2016 at 11:45 PM, Random832 <random832@fastmail.com> wrote:
On Thu, Apr 14, 2016, at 09:40, Nick Coghlan wrote:
That's not a *new* problem though, it already exists if you pass in a mix of bytes and str:
There's also already a solution (regardless of whether you want bytes or str as the result), which is to explicitly coerce all the arguments to the same type:
It'd be nice if that went away. Having to do that makes about as much sense to me as if you had to explicitly coerce an int to a float to add them together. Sure, explicit is better than implicit, but there are limits. You're explicitly calling os.path.join; isn't that explicit enough?
Adding integers and floats is considered "safe" because most people's use of floats completely compasses their use of ints. (You'll get OverflowError if it can't be represented.) But float and Decimal are considered "unsafe":
1.5 + decimal.Decimal("1.5") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'float' and 'decimal.Decimal'
This is more what's happening here. Floats and Decimals can represent similar sorts of things, but with enough incompatibilities that you can't simply merge them. ChrisA

On Thu, Apr 14, 2016, at 09:50, Chris Angelico wrote:
Adding integers and floats is considered "safe" because most people's use of floats completely compasses their use of ints. (You'll get OverflowError if it can't be represented.) But float and Decimal are considered "unsafe":
1.5 + decimal.Decimal("1.5") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'float' and 'decimal.Decimal'
This is more what's happening here. Floats and Decimals can represent similar sorts of things, but with enough incompatibilities that you can't simply merge them.
And what such incompatibilities exist between bytes and str for the purpose of representing file paths? At the end of the day, there's exactly one answer to "what file on disk this represents (or would represent if it existed)".

On 04/14/2016 07:01 AM, Random832 wrote:
On Thu, Apr 14, 2016, at 09:50, Chris Angelico wrote:
Adding integers and floats is considered "safe" because most people's use of floats completely compasses their use of ints. (You'll get OverflowError if it can't be represented.) But float and Decimal are considered "unsafe":
--> 1.5 + decimal.Decimal("1.5") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'float' and 'decimal.Decimal'
This is more what's happening here. Floats and Decimals can represent similar sorts of things, but with enough incompatibilities that you can't simply merge them.
And what such incompatibilities exist between bytes and str for the purpose of representing file paths? At the end of the day, there's exactly one answer to "what file on disk this represents (or would represent if it existed)".
Interoperability with other systems and/or libraries. If we use surrogateescape to transform str to bytes, and the other side does not, we no longer have a workable path. -- ~Ethan~

2016-04-14 17:29 GMT+02:00 Ethan Furman <ethan@stoneleaf.us>:
Interoperability with other systems and/or libraries. If we use surrogateescape to transform str to bytes, and the other side does not, we no longer have a workable path.
I guess that you mean a Python library? When you exchange with external programs or call a C libraries, Python is responsible to encode Unicode to bytes with os.fsencode(). The external part is not aware that Python uses surrogateescape, it gets "regular" bytes. I suggest to consider such Python library as external programs and libraries: convert Unicode to bytes with os.fsencode(), but also process paths as Unicode "inside" your application. It's the basic rule to handle correctly Unicode in an application: decode inputs as soon as possible, and encode back as late as possible. Encode/decode at borders. Victor

Random832 writes:
And what such incompatibilities exist between bytes and str for the purpose of representing file paths?
A plethora of encodings.
At the end of the day, there's exactly one answer to "what file on disk this represents (or would represent if it existed)".
Nope. Suppose those bytes were read from a file or a socket? It's dangerous to assume that encoding matches the file system's.

On Thu, Apr 14, 2016, at 12:05, Stephen J. Turnbull wrote:
Random832 writes:
And what such incompatibilities exist between bytes and str for the purpose of representing file paths?
A plethora of encodings.
Only one encoding, fsencode/fsdecode. All other encodings are not for filenames.
At the end of the day, there's exactly one answer to "what file on disk this represents (or would represent if it existed)".
Nope. Suppose those bytes were read from a file or a socket? It's dangerous to assume that encoding matches the file system's.
Why can I pass them to os.open, then, or to os.path.join so long as everything else is also bytes? On UNIX, the filesystem is in bytes, so saying that bytes can't match the filesystem is absurd. Converting it to str with fsdecode will *always, absolutely, 100% of the time* give a str that will address the same file that the bytes does (even if it's "dangerous" to assume that was the name the user wanted, that's beyond the scope of what the module is capable of dealing with).

On 15 April 2016 at 00:01, Random832 <random832@fastmail.com> wrote:
On Thu, Apr 14, 2016, at 09:50, Chris Angelico wrote:
Adding integers and floats is considered "safe" because most people's use of floats completely compasses their use of ints. (You'll get OverflowError if it can't be represented.) But float and Decimal are considered "unsafe":
1.5 + decimal.Decimal("1.5") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'float' and 'decimal.Decimal'
This is more what's happening here. Floats and Decimals can represent similar sorts of things, but with enough incompatibilities that you can't simply merge them.
And what such incompatibilities exist between bytes and str for the purpose of representing file paths? At the end of the day, there's exactly one answer to "what file on disk this represents (or would represent if it existed)".
Bytes paths on WIndows are encoded as mbcs for use with the ASCII-only Windows APIs, and hence don't support the full range of characters that str does. The colloquial shorthand for that is "bytes paths don't work properly on Windows" (the more strictly accurate description is "bytes paths only work correctly on Windows if every code point in the path can be encoded using the 'mbcs' codec"). Even on *nix, os.fsencode may fail outright if the system is configured to use a non-universal encoding, while os.fsdecode may pollute the resulting string with surrogate escaped characters. Regardless of platform, if somebody hands you *mixed* bytes and str data, the appropriate default reaction is to complain about it rather than assume they meant one or the other. That complaint may take one of two forms: - for a high level, platform independent API, bytes should just be rejected outright - for a low level API with input type dependent behaviour, the input should be rejected as ambiguous - the API doesn't know whether the str behaviour or the bytes behaviour is the intended one pathlib falls into the first category - it just rejects bytes as input os.path.join falls into the second category - all str is fine, and all bytes is fine, but mixing them fails However, once somebody reaches for the coercion APIs (fsdecode and fsencode), they're now *explicitly* telling the interpreter what they want, since there's no ambiguity about the possible return types from those functions. In relation to Victor's comment about this being complex code to show to a novice: os.path.join(*map(os.fsdecode, ("str", b"bytes"))) I agree, but also think that's a good reason for people to switch to teaching novices pathlib rather than os.path, and letting them discover the underlying libraries as required by the code and examples they encounter. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-04-14 15:40 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
That's not a *new* problem though, it already exists if you pass in a mix of bytes and str: (...) There's also already a solution (regardless of whether you want bytes or str as the result), which is to explicitly coerce all the arguments to the same type:
os.path.join(*map(os.fsdecode, ("str", b"bytes"))) (...)
I don't understand. What is the point of adding a new __fspath__ protocol to *implicitly* convert path objects to strings, if you still have to use an explicit conversion? I would really expect that a high-level API like pathlib would solve encodings issues for me. IMHO DirEntry entries created by os.scandir(bytes) must use os.fsdecode() in their __fspath__ method. os.path.join() is just one example of an operation on multiple paths. Look at os.path for other example ;-)
os.path.join(*map(os.fsdecode, ("str", b"bytes")))
This code is quite complex for a newbie, don't you think so? My example was os.path.join(pathlib_path, "str_path", direntry) where we can do something to make the API easier to use. I don't propose to do anything for os.path.join("str", b"bytes") which would continue to fail with TypeError, *as expected*. Victor

On 04/14/2016 06:56 AM, Victor Stinner wrote:
2016-04-14 15:40 GMT+02:00 Nick Coghlan:
Even earlier, Victor Stinner wrote:
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
That's not a *new* problem though, it already exists if you pass in a mix of bytes and str: (...) There's also already a solution (regardless of whether you want bytes or str as the result), which is to explicitly coerce all the arguments to the same type:
--> os.path.join(*map(os.fsdecode, ("str", b"bytes"))) (...)
I don't understand. What is the point of adding a new __fspath__ protocol to *implicitly* convert path objects to strings, if you still have to use an explicit conversion?
That's the crux of the issue -- some of us think the job of __fspath__ is to simply retrieve the inherent data from the pathy object, *not* to do any implicit conversions.
I would really expect that a high-level API like pathlib would solve encodings issues for me. IMHO DirEntry entries created by os.scandir(bytes) must use os.fsdecode() in their __fspath__ method.
Then let pathlib do it. As a high-level interface I have no issue with pathlib converting DirEntry bytes objects to str using fsdecode (or whatever makes sense); os.path.join (and by extension os.fspath and __fspath__) should do no such thing.
os.path.join(*map(os.fsdecode, ("str", b"bytes")))
This code is quite complex for a newbie, don't you think so?
A newbie should be using pathlib. If pathlib is not low-level enough, then the newbie needs to learn about low-level stuff. -- ~Ethan~

On 04/14/2016 05:16 AM, Victor Stinner wrote:
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
Even if direntry uses a bytes filename. I expect genericpath.join() to be patched to use os.fspath(). If os.fspath() returns bytes, path.join() will fail with an annoying TypeError.
I expect that DirEntry.__fspath__ uses os.fsdecode() to return str, just to make my life easier.
This would be where we strongly disagree. If pathlib, as a high-level construct, wants to take that approach I have no issues, but the functions in os are low-level and as such should not be changing data types unless I ask for it. I see __fspath__ as a retrieval mechanism, not a data-transformation mechanism.
You can apply the same rationale for the flavors 2 and 3 (os.fspath(path, allow_bytes=True)). Indirectly, you will get similar TypeError on os.path.join().
And that's fine. Low-level interfaces should not change data types unless explicitly requested -- and we have fsencode() and fsdecode() for that. -- ~Ethan~

2016-04-14 16:54 GMT+02:00 Ethan Furman <ethan@stoneleaf.us>:
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
(...) I expect that DirEntry.__fspath__ uses os.fsdecode() to return str, just to make my life easier.
This would be where we strongly disagree.
FYI it's ok that we disagree on this point, at least I expressed my opinion ;-) At least, we now identified better a point of disagreement. Victor

On 04/14/2016 09:09 AM, Victor Stinner wrote:
2016-04-14 16:54 GMT+02:00 Ethan Furman:
I consider that the final goal of the whole discussion is to support something like:
path = os.path.join(pathlib_path, "str_path", direntry)
(...) I expect that DirEntry.__fspath__ uses os.fsdecode() to return str, just to make my life easier.
This would be where we strongly disagree.
FYI it's ok that we disagree on this point, at least I expressed my opinion ;-)
Absolutely. I appreciate you explaining your point of view.
At least, we now identified better a point of disagreement.
Agreed. :) ~Ethan~
participants (19)
-
Alexander Walters
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Donald Stufft
-
Ethan Furman
-
Fred Drake
-
Greg Ewing
-
Koos Zevenhoven
-
Michael Mysinger
-
Nick Coghlan
-
Nikolaus Rath
-
Paul Moore
-
Random832
-
Stephen J. Turnbull
-
Sven R. Kunze
-
Victor Stinner