On 6 April 2016 at 17:04, Koos Zevenhoven
In the "Working with Path objects: p-strings?" thread, I said I was working on a proposal. Since it's been several days already, I think i should post it here and get some feedback before going any further. Maybe I should have done that even earlier. Anyway, there are some rough edges, and I will need to add links to references etc.
Thanks for putting this together. I don't agree with much of it, but it's good to have the proposal stated so clearly.
So, do not hesitate to give feedback or criticism, which is especially appreciated it you take the time to read through the whole thing first :).
While I've read the whole proposal, there's a lot to digest, and honestly I don't have the time to spend on this right now - so my apologies if I missed anything relevant. Hopefully my comments will make sense anyway :-)
Filesystem paths are strings that give instructions for traversing a directory tree. In Python, they have traditionally been represented as byte strings, and more recently, unicode string. However, Python now has ``pathlib`` in the standard library, which is an object-oriented library for dealing with objects specialized in representing a path and working with it. In this proposal, such objects are generally referred to as *path objects*, or sometimes, in the specific context of instances of the ``pathlib`` path classes, they are explicitly referred to as ``pathlib`` objects.
I'm not sure I agree with this. To me, "filesystem paths" are a things which define the location of a file in a filesystem. They are not strings, even though they can be represented by strings (actually, they can't, technically - POSIX allows nearly arbitrary bytestrings for for paths, whereas Python strings are Unicode). Saying a path is a string is no more true than saying that integers are strings that represent whole numbers. Traditionally, people haven't thought of paths as objects because not many languages provide *any* sort of abstraction of paths - doing so in a cross-platform way is *hard* and most languages duck the issue. Python is exceptional in providing good path manipulation functions (even os.path is streets ahead of what many other languages offer).
Filesystem paths (or comparable things like URIs) are strings of characters that represent information needed to access a file or directory (or other resource). In other words, they form a subset of strings, involving specialized functionality such as joining absolute and relative paths together, accessing different parts of the path or file name, and even accessing the resources the path points to. In Python terms, for a path ``path``, one would have ``isinstance(path, str)``. It is also clear that not all strings are paths.
As noted above, this makes no sense to me. By this argument "integers are strings of characters that represent numbers". The string representation of an object is *not* the object.
On the one hand, this would make an ideal case for making all path-representing objects inherit from ``str``; while Python tries not to over-emphasize object-oriented programming and inheritance, it should not try to avoid class hierarchies when they are appropriate in terms of both purity and practicality. Regarding practicality, making specialized *path objects* also instances of ``str`` would make almost any stdlib or third-party function accept path objects as path arguments, assuming that they accept any instance of ``str``. Furthermore, functions now returning instances of ``str`` to represent paths could in future versions return path objects, with only minor backwards-incompatibility worries.
You mention both practicality and purity here but only offer "practical" arguments. The practical arguments are fair, and as far as I can see are the crux of any proposal to make Path objects subclass str. You should focus on this, and not try to argue that subclassing str is "right" in any purity sense.
On the other hand, strings are a very general concept, and the Python ``str`` class provides a large variety of methods to manipulate and work with them, including ``.split()``, ``.find()``, ``.isnumeric()`` and ``.join()``. These operations may be defined just as well for a string that represents a path than for any other string. In fact, this is the status quo in Python, as the adoption of ``pathlib`` is still quite limited and paths are in most cases represented as strings (sometimes byte strings). But while the string operations are *defined* on path-representing strings, the results of these operations may not be of any use in most cases, even if in some cases, they may be.
This seems to me to be a key point - if (many) of the operations that are part of the interface of a string don't make sense for a filesystem path, doesn't that very clearly make the point that filesystem paths are *not* strings?
There is prior art in subclassing the Python ``str`` type to build a path object type. Packages on PyPI (TODO: list more?) that do this
pylib's path.local object (used in pytest in particular) is another.
include ``path.py`` and ``antipathy``. The latter also supports ``bytes``-based paths by instantiating a different class, a subclass of ``bytes``. Since these libraries have existed for several years, experience from them is available for evaluating the potential benefits and weaknesses of this proposal (as well as other aspects regarding ``pathlib``).
I don't think there's been any attempt made to collect or quantify that experience, though. All I've ever seen is hearsay "I've not heard of anyone reporting problems" evidence. While anecdotal evidence is a lot better than nothing, it's of limited value. Apart from anything else, there's a self-selection issue - people who *did* have problems may simply have stopped using the libraries.
Overriding all ``str``-specific methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since most of the ``str`` methods are not of any use on paths and can be confusing, leading to undesired behavior, *most* ``str`` methods (including magic methods, but excluding methods listed below) are overridden in ``PurePath`` with methods that by default raise ``TypeError("str method '<name>' is not available for paths."``. This will help programmers to immediately notice when they are using the wrong method. The perhaps unusual practice of disabling most base-class methods can be regarded as being conservative in adding ``str`` functionality to path objects.
This seems to me to be the biggest issue. You're proposing that Path objects will subclass strings, but code written to expect a string may fail if passed a Path object. Presumably though that code works if passed str(the_path_object) - as it works correctly right now. Maybe it's doing "string-like" things, but equally, it's presumably intended to. Consider a "make path uppercase" function that simply does .upper() on its argument. You are proposing a class that is a subclass of str, but calling str() on an instance gives an object that behaves differently. That's bizarre at best, and realistically I'd describe it as fundamentally broken. I don't want to argue type-theory here, but I'm pretty sure that violates most people's intuition of what inheritance means.
Optional enabling of string methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since many APIs currently have functions or methods that return paths as strings, existing code may expect to have all string functionality available on the returned objects. While most users are unlikely to use much of the ``str`` functionality, a library function may want to explicitly allow these operations on a path object that it returns. Therefore, the overridden ``str`` methods can be enabled by setting a ``._enable_str_functionality`` method on a path object as follows:
- ``pathobj._enable_str_functionality = True #`` -- Enable ``str`` methods - ``pathobj._enable_str_functionality = 'warn' #`` -- Enable ``str`` methods, but emit a ``FutureWarning`` with the message ``"str method '<name>' may be disabled on paths in future versions."``
This is a huge chunk of extra complexity, both in terms of implementation, and even more so in terms of understanding. If someone wants a "real" string, just call cast using str() or use the .path attribute. This whole section of the proposal says to me that you haven't actually solved the problem you're trying to solve - you still expect people to have problems passing Path objects to functions that aren't expecting them, and you've had to consider how to work round that. The fact that you came up with (in effect) a "configuration flag" on an immutable object like a Path rather than just using the existing "give me a real string" options on Path, implies that your proposal is not well thought through in this area. Here's some questions for you (but IMO this section is unfixable - no matter what answers you give, I still consider this whole mechanism as a non-starter). * Are Path objects hashable, given they now have a mutable attribute? * If you change the _enable_str_functionality flag, does the object's hash change? * If it doesn't, what happens when you add 2 identical paths with different _enable_str_functionality flags to a set? * If you enable str methods do they return str or Path objects? If the latter, what is the flag set to on these objects? Basically, you broke a fundamental property of both Path and string objects - they are immutable.
Changes needed to other stdlib modules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In stdlib modules other than ``pathlib``, mainly ``os``, ``ntpath`` and ``posixpath``, The stdlib functions in modules that use the methods/functionality listed below on path or file names, will be modified to explicitly convert the name ``name`` to a plain string first, e.g., using ``getattr(name, 'path', name)``, which also works for ``DirEntry`` but may return ``bytes``:
- ``split`` - ``find`` - ``rfind`` - ``partition`` - ``__iter__`` - ``__getitem__``
This can be done with the current Path objects (and should). It is unrelated to this proposal. And it doesn't need to be restricted to "if overridden string functions are used". Just do it regardless, and all existing functions work immediately. The only issue is functions that *return* paths. And they are no harder under current Pathlib than under your proposal - a decision on what type to return has to be made either way.
Guidelines for third-party package maintainers ----------------------------------------------
Libraries that take paths as arguments or return them ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since all of the standard library will accept path objects as path arguments, most third-party libraries will automatically do so. However, those that directly manipulate or examine the path name using ``str`` methods may not work. Those libraries will not immediately be ``pathlib``-compatible.
Overcomplicated. If you accept paths, just do getattr(patharg, 'path', patharg) and you're fine. If you return paths, do nothing (or if you prefer, think about your API and make a more considered decision). Your proposal means that library authors have to actually consider whether the new path objects will cause subtle failures, because the string-like objects will not fail quickly, leading to bugs propogating into unrelated code. Overall, I'm a strong -1. If we subclass str, we should just do it and not over-complicate like this. I'm still not convinced we should do so, but your proposal *has* convinced me that any attempt to compromise is going to end up being worse than either option. Sorry I can't be more positive - but again, thanks for the thorough write-up. Paul