[Catalog-sig] PyPI improvements

Wed Jun 16 00:32:13 EDT 2004

On Jun 15, 2004, at 10:30 PM, Richard Jones wrote:
>> 1. Express relationships between packages.  These are relationships
>> like alternative-implementation, fork, part-of, recommends, requires,
>> etc.  At the moment I'm thinking purely about displaying this
>> information, not any fancy distutils magic installation of
>> dependencies.
>
> There's been a number of proposals and I believe some code towards
> implementing this kind of meta-data capture.
>
> The two extensions to distutils dealing with this issue that I know of 
> are
> PIMP (/PackMan) and the ZPKG tools:
>
> http://undefined.org/python/pimp/
> http://www.python.org/packman/
>    (couldn't find a page giving the technical details of PIMP)
> http://zope.org/Members/fdrake/zpkgtools/
>    (this page has a good list of links to prior discussions / 
> proposals)
>
> Various proposals have also been made on this list. I have no idea how 
> related
> those projects are. It would be a shame to develop *another* system.

I'm not entirely clear on all of these, but I think they all are 
looking for dependencies.  Along with that they need canonical 
identifiers, which PyPI already has well enough (package names).  For 
modules this wouldn't work, as the naming would be less unique.  Module 
identifiers would be an issue, but I don't think they'd participate in 
automated dependencies quite so much.

>> 2. Cache packages.  I.e., download a copy of the package, and if the
>> package disappears then we have a backup.
>
> The disappearance of packages is a concern. An archive network would 
> solve
> this issue, but it requires both organisation and support from hosts. 
> I'm
> pretty sure the current python.org machine is not suitable for storing
> packages.

It should mostly take disk space, at least how I'm envisioning it.  If 
each package has a download URL (that's a real download URL, not just a 
web page with other references) then we cache the archives and provide 
a link to that archive if we detect that the source archive is gone.  
Packages without download locations won't be very popular (though they 
can still be interesting -- I've certainly found links to missing code 
that would interest me).

>> The other thing that might be useful is some improved categorization 
>> of
>> code.  The Trove categories are... well, they are stupid.  No fault of
>> anyone here.  CPAN's much more coarsely-grained categories are much
>> better, in my opinion (Acme, AI, Algorithm, Apache, AppConfig, 
>> Archive,
>> Array, and so on: http://www.cpan.org/modules/by-module
>
> The current Trove list may be extended - I simply drew on the two 
> best-known
> lists: sourceforge and freshmeat.
>
> What's the "Acme" category hold? :)

Joke modules, I believe.  Pythonistas apparently aren't as prone to 
humor.  So it goes.

I've found the trove categories to be overwhelming to use when creating 
packages, and I've never paid attention to them when looking for 
packages.  In part because I can't expect authors to have defined 
categories for their package.

In Perl the categories are also caught up in naming, which I don't 
think we'd use.  And you can't belong to multiple categories, for the 
same reason.  But I think they present a simpler set of categories that 
would be more useful.  The Vaults has a reasonable set of categories as 
well.  We just need less categories.

>> But even more coarsely-grained than that, there are classes of 
>> package.
>> Right now we have libraries and applications.
>
> PyPI doesn't make this distinction - though I believe it is a useful 
> one.
>
>
>> I'd like to add modules -- though the name is vague, I'm thinking of
>> code on the sophisticated end of the Python Cookbook entries.  Small,
>> reusable, and not worth distutilifying
>
> This sounds like a good idea, but raises a couple of issues:
>
> 1. Distutils isn't involed, but that's OK since PyPI allows TTW entry
>    of package meta-data.

I'd probably want to set up a automatic submission client that uses 
docstrings, but that's a separate issue.

> 2. PyPI currently makes no assumptions about what the download_url
>    points to. Would you advocate using the download_url for locating
>    the module source?

Yes, or another field.  Freshmeat allows for a set of download URLs, 
which would potentially help this -- i.e., Windows installer, tarball, 
rpm or deb, etc.

> As I said in response to your weblog entry:
>
> "PyPI is intended to be an index of metadata that is generated by 
> distutils.
> I'm not sure I'm comfortable extending that scope to include actual 
> code
> fragments. It would confuse the meta-data schema and user interfaces
> considerably."

The idea of broad categories (application, library, module) may 
alleviate the UI issues.  We already have enough fragmentation -- even 
the Vaults get new submissions that don't go to PyPI -- so I'd hate to 
set up an entirely separate system.  It could be parallel, but that 
doesn't seem necessary.  Anyway, the prerequisite features are 
generally useful, so it's not a decision that has to happen yet.

>> When you're looking for code, each of these is quite different from 
>> the
>> others -- for any search, you will probably be interested in any of
>> these (a library to use, or a module or application to borrow from).
>
> Yep. And note that some entries will span two (or all?) categories - 
> Roundup,
> for example, is both a library and an application.

Maybe that's what would be called a "framework".  But yes, it's a 
little vague.

>> Right now we're neither here nor there, as people don't think to add
>> applications to PyPI, and the trove categories are inappropriate for
>> libraries.
>
> I don't believe the categories as they stand are *that* useless!

Perhaps.  On one hand they are a set of properties (e.g., development 
status or natural language), which you probably wouldn't search on, but 
which are useful fields.  Or a broad filter that would be appropriate 
for separate interfaces (intended audience).  Or largely meaningless 
(at least for libraries, particularly OS and programming language).  
Which leaves the topics, which aren't the best set of categories.  And 
I don't think I'd rely on them.

So far I just have searched on the description.  A full-text of 
description, keywords, title, and classifiers would probably be my 
favorite search if available.  Unless I'm searching for a specific 
package, I would find searching on any single field (including 
category) to be too restrictive and too likely to cause me to miss 
something interesting.

>> On top of this is the infrastructure issue, which probably also has to
>> be dealt with before moving forward much (i.e., SQLite and CGI).
>> Concurrent updates to a SQLite database from multiple processes scares
>> the crap out of me.  But it doesn't look like that should be too hard
>> to fix.
>
> As I said in response to your weblog entry:
>
> "Finally, PyPI is bordering on being too large for the technologies 
> it's built
> on; sqlite will need to be replaced by postgresql some time soon and 
> the
> cgi.py-based web ui scales very poorly. Development such as you're 
> proposing
> would push those technologies over the edge :)"
>
>
> On a separate topic, I believe it's pretty important that a document be
> written that captures your intentions. A lot of ideas have floated 
> around on
> this list over the years - only to be subsequently forgotten because 
> they're
> lost in the list archive. Yes, I'm suggesting writing a PEP about it. 
> That
> way there's a single place someone can go to see the content and 
> status of
> the proposal.

Sure, after a bit of back-and-forth here.  Maybe it would be easier to 
just write something up to be put in docs/ in CVS.

--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org