[Python-Dev] pathlib - current status of discussions

Tue Apr 12 04:31:21 EDT 2016

On 12 April 2016 at 06:28, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Donald Stufft writes:
>
>  > I think yes and yes [__fspath__ and fspath should be allowed to
>  > handle bytes, otherwise] it seems like making it needlessly harder
>  > to deal with a bytes path
>
> It's not needless.  This kind of polymorphism makes it hard to review
> code locally.  Once bytes get a foothold inside a text application,
> they metastasize altogether too easily, and you end up with TypeErrors
> or UnicodeErrors quite far from the origin.  Debugging often requires
> tracing data flows over hill and over dale while choking from the
> dusty trail, or band-aids like a top-level "except UnicodeError:
> log_and_quarantine(bytes)".  I can't prove that returning bytes from
> these APIs is a big risk in this sense, but I can't see a way to prove
> that it's not, either, given that their point is duck-typing, and
> therefore they may be generalized in the future, and by third parties.
>
> I understand that there are applications where it's bytes all the way
> down, but by the very nature of computing systems, there are systems
> where bytes are decoded to text.  For historical reasons (the encoding
> Tower of Babel), it's very error-prone to do that on demand.  Best
> practice is to do the conversion as close to the boundary as possible,
> and process only text internally.
>
> In text applications, "bytes as carcinogen" is an apt metaphor.
>
> Now, I'm not Dutch, so I can't tell you it's obvious that the risk to
> text-processing applications is more important than the inconvenience
> to byte-shoveling applications.  But there is a need to be
> parsimonious with polymorphism.

As someone who has done a lot of work helping projects to port from
the 2.x bytes/text model to the 3.x model, I have similar concerns
that rooting out the source of bytes objects appearing in a program
could be an issue with the proposed "return either" approach. The most
effective tool I have found in fixing programs with text/bytes issues
is carefully and thoroughly annotating precisely which functions
accept and return bytes, and which accept and return text. The sort of
mixed-mode processing we're talking about here makes that
substantially harder. And note that the signature of os.fspath can
return bytes or text *independent* of the type of the argument - it's
not a "bytes in, bytes out" function like the usual pattern of
"polymorphic support for bytes".

But just like Stephen, I have no feel for how significant the risk
will be in real life. I've never worked on code that actually has a
need for bytestring paths (particularly now that surrogateescape
ensures that most cases "just work").

Paul