[Distutils] Symlinks vs API -- question for developers

Mon Oct 20 01:28:01 CEST 2008

At 10:32 AM 10/17/2008 -0700, Toshio Kuratomi wrote:
>So I have a question for all the developers on this list.  Philip thinks
>that using symlinks will drive adoption better than an API to access
>package data.  I think an API will have better adoption than a symlink
>hack.  But the real question is what do people who maintain packages
>think?  Since Philip's given his reasoning, here's mine:
>
>1) Philip says that with symlinks distributions will likely have to
>submit patches to the build scripts to tag various files as belonging to
>certain categories.  If you, as an upstream are going to accept a patch
>to your build scripts to place files in a different place wouldn't you
>also accept a patch to your source code to use a well defined API to
>pull files from a different source?  This is a distribution's bread and
>butter and if there's a small, useful, well-liked, standard API for
>accessing data files you will start receiving patches from distributions
>that want to help you help them.

I'll leave this to the developers, but please note that the real 
historical answer to this question is "no", or at least "not in the 
current release".  Keep in mind that most "yeses" you get to this 
question will really mean, "when I can get around to understanding 
the API and testing it and have time to put it in a new release" -- 
while the "yeses" for adding spec metadata are more likely to mean, 
"yes, I'll check it in right now if it looks correct".

>2) Symlinks cannot be used universally.  Although it might not be common
>to want an FHS style install in such an environment, it isn't unheard
>of.  At one time in the distant past I had to use cygwin so I know that
>while this may be a corner case, it does exist.

Cygwin does symlinks, actually.

>3) The primary argument for symlinks is that symlinks are compatible
>with __file__.  But this compatibility comes at a cost -- symlinks can't
>do anything extra.  In a different subthread Philip argues that
>setuptools provides more than distutils and that's why people switch and
>that the next generation tool needs to provide even more than
>setuptools.  Symlinks cannot do that.

I think Ian's already said this, but the API itself has to do 
something more, and so far nobody's proposed an API that does 
anything "more" than what setuptools does in this area, from the 
developer point of view.  (Except for the request that such an API be 
in the stdlib and thus avoid an extra dependency...  but that of 
course introduces yet another implementation delay, if it means a new 
release of Python.)

>4) In contrast an API can do more:  It can deal with writable files. On
>Unix, persistent, per user storage would go in the user's home
>directory, on other OS's it would go somewhere else.  This is
>abstractable using an API at runtime but not using symlinks at install time.

This is all well and good, but it's actually quite orthogonal to most 
uses of __file__ and resources today.

>5) cross package data.  Using __file__ to detect file location is
>inherently not suitable for crossing package boundaries.  Egg
>Translations would not be able to use a symlink based backend to do its
>work for this reason.

EggTranslations doesn't use __file__, it uses the API, so I don't see 
how this relates.

>6) zipped eggs.  These require an API.  So moving to symlinks is
>actually a regression.

As I mentioned earlier, setuptools marks eggs that use __file__ as 
needing to be installed unzipped, so it's not a regression; it's 
simply providing the same level of compatibility that setuptools does.

It's requiring the use of an API that's a regression wrt 
developer-side features.

>7) Philip says that the reason pkg_resources does not see widespread
>adoption is that the developer cost of using an API is too high compared
>to __file__.  I don't believe that the difference between file and API
>is that great.

It isn't; it's the *switching* cost that's high, and that's the cost 
that needs to be minimized in order to drive adoption quickly.

>[snip]

I'll just note that the bullets I'm skipping are mostly irrelevant to 
the issue at hand: i.e., switching cost of using *any* API, AND 
switching cost for the developers who *are* using pkg_resources 
presently.  Let's not forget that second group of people, because the 
fact they are using the API shows they are likely early 
adopters.  Make it too hard for them to switch, and you might not 
have any early adopters left for the new thing.  ;-)

>* The API isn't flexible enough.  EggTranslations places its data within
>the metadata store of eggs instead of within the data store.  This is
>because the metadata is able to be read outside of the package in which
>it is included while the package data can only be accessed from within
>the package.

Actually, this is incorrect.  EggTranslations' use of project-level 
data is so that it's not necessary to include a Python module in the 
egg, just to have a place to put the data.  Access from other 
packages hasn't got anything to do with it.

>8) To a distribution, symlinks are just a hack.  We use them for things
>like php web apps when the web application is hardcoded to accept only
>one path for things (like the writable state files being intermixed with
>the program code).  Managing a symlink farm is not something
>distributions are going to get excited over so adoption by distributions
>that this is the way to work with files won't happen until upstreams
>move on their own.

We need to distinguish between "providing the ability to have a 
low-cost transition" and "the recommended True Way".

IOW, symlinks and an API are not mutually exclusive; I'm just 
pointing out that if an API is required, the transition of packages 
to the new standard will occur *only as quickly as the slowest 
upstream dependency*.

If the developer of A depends on B, and B hasn't transitioned yet, 
then A can't transition.

>Further, since the install tool is being proposed as a separate project
>from the metadata to mark files, the expectation is that the
>distributions are going to want to write an install tool that manages
>this symlink farm.  For that to happen, you have to get distributions to
>be much more than simply neutral about the idea of symlinks, you have to
>have them enthused enough about using symlinks that they are willing to
>spend time writing a tool to do it.

Well, the question is whether they prefer to have a long, drawn out 
transition or not.  Maybe they don't care about that part, but my 
assumption was that a replacement for setuptools/easy_install in this 
space was desired sooner rather than later.

If that's the case, then making it possible for packages to 
transition without changing their runtime code is a must-have.

>So once again, I think this boils down to these questions: if we have a
>small library whose sole purpose is to abstract a data store so you can
>find out where a particular non-code file lives on this system will you
>use it?  If a distribution packager sends you a patch so the data files
>are marked correctly and the code can retrieve their location instead of
>hardcoding an offset against __file__ will you commit it?

I think the answer to both questions is "yes... eventually...  if the 
API is in the stdlib for all Python versions I'm targeting and 
everybody else is doing it."  Which is why *requiring* it for 
transition will prevent the distros from seeing benefits from a new 
standard for quite some time.

Conversely, if the patch for installation metadata is separated from 
patches to code, I would expect a *much* faster uptake of the 
metadata patches.  And, once having accepted the metadata patch, a 
developer is actually more likely to take the second step willingly, 
than if required to do both at once.  (See "Influence" by Cialdini.)

To be 100% clear (I hope): I have no objection to an API.  It's 
unequivocally a good idea, and *should* be part of 
BUILDS.  *Requiring* it, on the other hand, is unequivocally a *bad* 
idea, if you want adoption sooner rather than later.

Now, if you want to establish a transition timetable for phasing out 
__file__ usage, deprecation, etc., based on when the API will be 
available in the stdlib etc., publicize and bless that schedule, 
etc...  again, these are all good ideas.

The ONLY thing I object to is requiring it up front from day 1, 
because then we're just shooting off a giant foot-gun wrt adoption.