[Distutils] Symlinks vs API -- question for developers

Fri Oct 17 22:49:05 CEST 2008

Toshio Kuratomi wrote:
> So I have a question for all the developers on this list.  Philip thinks
> that using symlinks will drive adoption better than an API to access
> package data.  I think an API will have better adoption than a symlink
> hack.  But the real question is what do people who maintain packages
> think?  Since Philip's given his reasoning, here's mine:
> 
> 1) Philip says that with symlinks distributions will likely have to
> submit patches to the build scripts to tag various files as belonging to
> certain categories.  If you, as an upstream are going to accept a patch
> to your build scripts to place files in a different place wouldn't you
> also accept a patch to your source code to use a well defined API to
> pull files from a different source?  This is a distribution's bread and
> butter and if there's a small, useful, well-liked, standard API for
> accessing data files you will start receiving patches from distributions
> that want to help you help them.

Annotating my files is extremely unlike to break code, so I am more 
likely to accept a patch that does that.

> 2) Symlinks cannot be used universally.  Although it might not be common
> to want an FHS style install in such an environment, it isn't unheard
> of.  At one time in the distant past I had to use cygwin so I know that
> while this may be a corner case, it does exist.
> 
> 3) The primary argument for symlinks is that symlinks are compatible
> with __file__.  But this compatibility comes at a cost -- symlinks can't
> do anything extra.  In a different subthread Philip argues that
> setuptools provides more than distutils and that's why people switch and
> that the next generation tool needs to provide even more than
> setuptools.  Symlinks cannot do that.

As a library writer I have no motivation to do any of this.  New 
features do drive adoption more quickly than simple cleanup, but only 
features that would help me as a developer in some way (including making 
it easier to support users).  A new API wouldn't help me, and might hurt 
as it means more conventions to communicate to other developers.  Also 
I'd have to debug problems with the resource loading, which be nothing 
but frustration.  I hate platform issues, and moving files around just 
means there's more platform issues I'd be exposed to.  Nothing 
platform-specific is of any interest to me as a developer -- 
unfortunately such problems come up often, but I don't want to go 
looking for new platform issues.

> 4) In contrast an API can do more:  It can deal with writable files. On
> Unix, persistent, per user storage would go in the user's home
> directory, on other OS's it would go somewhere else.  This is
> abstractable using an API at runtime but not using symlinks at install time.

Writable stuff is quite different, IMHO.  An API for writable files 
might be useful, but there's no current conventions around it, and I 
would expect that API to be entirely different from a resource API.

> 5) cross package data.  Using __file__ to detect file location is
> inherently not suitable for crossing package boundaries.  Egg
> Translations would not be able to use a symlink based backend to do its
> work for this reason.

You'll need to explain further, as I am unclear of the problem with 
__file__ in this context.  For instance, couldn't you symlink 
somepackage/translations/ to /usr/share/lang/somepackage ?

> 6) zipped eggs.  These require an API.  So moving to symlinks is
> actually a regression.

True.

> 7) Philip says that the reason pkg_resources does not see widespread
> adoption is that the developer cost of using an API is too high compared
> to __file__.  I don't believe that the difference between file and API
> is that great.  An example of using an API could be something like this:
> 
> Symlinks::
>   import os
>   icondirectory = os.path.join(os.path.basename(__file__), 'icons')
> 
> API::
>   import pkgdata
>   icondirectory = pkgdata.resource(pkg='setuptools', \
>       category='icon', resource='setuptools.png')
> 
> Instead I think the data handling portion of pkg_resources is not more
> widely adopted for these reasons:

Just personally, it's entirely laziness on my part; I can't remember the 
signatures for the resource stuff, so I write what I most immediately 
remember.  I think the Distro/package ambiguity also confuses me.

> * pkg_resources's package handling is painful for the not-infrequent
> corner cases.  So people who have encountered the problems with
> require() not overriding a default or not selecting the proper version
> when multiple packages specify overlapping version ranges already have a
> negative impression of the library before they even get to the data
> handling portion.
> 
> * pkg_resources does too much: loading libraries by version really has
> nothing to do with loading data for use by a library.  This is a
> drawback because people think of and promote pkg_resources as a way to
> enable easy_install rather than a way to enable abstraction of data
> location.
> 
> * The only benefit (at least, being promoted in the documentation) is to
> allow zipped eggs to work.  Distributions have no reason to create
> zipped eggs so they have no reason to submit patches to upstream to
> support the pkg_resources api.
> 
> * Distributions, further, don't want to install all-in-one egg
> directories on the system.  The pkg_resources API just gets in the way
> of doing things correctly in a distribution.  I've had to patch code to
> not use pkg_resources if data is installed in the FHS mandated areas.
> Far from encouraging distributions to send patches upstream to make
> modules use pkg_resources this makes distributions actively discourage
> upstreams from using it.
> 
> * The API isn't flexible enough.  EggTranslations places its data within
> the metadata store of eggs instead of within the data store.  This is
> because the metadata is able to be read outside of the package in which
> it is included while the package data can only be accessed from within
> the package.
> 
> 
> 8) To a distribution, symlinks are just a hack.  We use them for things
> like php web apps when the web application is hardcoded to accept only
> one path for things (like the writable state files being intermixed with
> the program code).  Managing a symlink farm is not something
> distributions are going to get excited over so adoption by distributions
> that this is the way to work with files won't happen until upstreams
> move on their own.
> 
> Further, since the install tool is being proposed as a separate project
> from the metadata to mark files, the expectation is that the
> distributions are going to want to write an install tool that manages
> this symlink farm.  For that to happen, you have to get distributions to
> be much more than simply neutral about the idea of symlinks, you have to
> have them enthused enough about using symlinks that they are willing to
> spend time writing a tool to do it.
> 
> 
> So once again, I think this boils down to these questions: if we have a
> small library whose sole purpose is to abstract a data store so you can
> find out where a particular non-code file lives on this system will you
> use it?  

Realistically, no.

> If a distribution packager sends you a patch so the data files
> are marked correctly and the code can retrieve their location instead of
> hardcoding an offset against __file__ will you commit it?

If it adds a dependency and an abstraction that isn't obvious, then no, 
I would not commit it.  Just marking the files is fine, because it has 
no impact on other code.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org