[Catalog-sig] PyPI improvements

Wed Jun 16 12:31:05 EDT 2004

Richard Jones wrote:
> On Wednesday 16 Jun 2004 14:32, Ian Bicking wrote:
>>For modules this wouldn't work, as the naming would be less unique.
>>Module identifiers would be an issue, but I don't think they'd
>>participate in automated dependencies quite so much.
> 
> 
> If you're going to have some meta-data embedded in the module, then one of 
> those fields can be a name in the PyPI namespace.
> 
> I think that if the modules are going to be in PyPI, then they've got to have 
> a unique name. Names are keys in PyPI (just as they are in CPAN / PAUSE).

I think they'd have to be parameterized in some way then -- stand-alone 
modules just aren't likely to be uniquely named.  Or, to make them 
uniquely named would lead to funky names (e.g., joe_screenscraper.py). 
Maybe it could be a name of author_username:module_name, or something 
like that.  Or maybe the names simply don't have to match the Python 
module name.

I really *don't* want to encourage a lot of distutiled modules with 
conflicting names.  In the comments on my post 
(http://blog.colorstudy.com/system/comments.py?u=0000001&p=P123) someone 
suggested automatically creating a zip/tarball with the proper setup.py, 
and I think that would be a bad idea and would lead to a polluted 
site-packages.

>>It should mostly take disk space, at least how I'm envisioning it.
> 
> Then current python.org (creosote) is definitely not up to the task.

Okay... then maybe it should be a RPC setup.  Like a client can query 
for a list of URLs that have not been archived (sufficiently), and can 
register the fact it has archived a URL.  Then there'd be the 
infrastructure so that archiving can be offloaded to another machine, 
though no archiving or mirroring would be built into the system.  There 
shouldn't be any lack of machines for the use -- disk and bandwidth is 
so cheap these days that a complex system like CPAN's seems like 
overkill.  I'm guessing creosote is just kind of old, and 
(understandably) no one wants to deal with the sysadmin issues of upgrading.

>>If 
>>each package has a download URL (that's a real download URL, not just a
>>web page with other references) then we cache the archives and provide
>>a link to that archive if we detect that the source archive is gone.
> 
> 
> I guess the issue is how we know what the download_url points to.
> 
> I think we agree that the distutils meta-data is going to have to grow some 
> additional fields (or single a complex field) that point specifically to 
> source, win32 binary, redhat RPM, etc. download files. Of course, for 
> projects hosted on sourceforge, all this is moot since there is no such thing 
> as a URL pointing to a file (ok, there is, but I suspect your project would 
> be booted if you used URLs pointing directly at mirrors).

I think the SF downloading should be built into the downloading client, 
as a kind of screen scraper to get to the real file.  There's so much 
stuff on SF that it can be special cased.  Alternately, we could ask 
submitters to give us a direct URL, and we would only distribute that 
URL to mirrors, never to users.

In case of URLs, perhaps we need a (url, url_type, url_description) 
relation, where url_type is restricted, and maybe (or maybe not) (url, 
url_type) is unique.  url_type would be like homepage, documentation, 
changelog, tar.gz download, Windows installer, Mac disk image, etc. 
Kind of like SourceForge does for downloads, but both a bit larger, and 
a bit less explicit.  (E.g., I think SF has two different types for .gz 
and .bz2 files, which seems unnecessary.)

>>>What's the "Acme" category hold? :)
>>
>>Joke modules, I believe.  Pythonistas apparently aren't as prone to
>>humor.  So it goes.
> 
> 
> That's what I figured. I'll take the rest of your statement in the sarcastic 
> light that it was obviously intended ;)

There really aren't that many joke Python modules, at least that I've 
seen.  Maybe because we lack the namespace for them.  Or because we are 
less prone to puns.

>>I've found the trove categories to be overwhelming to use when creating
>>packages, and I've never paid attention to them when looking for
>>packages.  In part because I can't expect authors to have defined
>>categories for their package.
> 
> 
> But they do. I've personally found the category searching to be quite 
> productive a couple of times now.
> 
> Perhaps I should generate some statistics? I'd have separate counts for users 
> using any categories and those using topics...

Statistics might be useful -- we already have the number of packages 
that are using categories (in the browse screen), but a statistic of the 
number of packages that aren't using the categories would be helpful.

>>In Perl the categories are also caught up in naming, which I don't
>>think we'd use.  And you can't belong to multiple categories, for the
>>same reason.  But I think they present a simpler set of categories that
>>would be more useful.  The Vaults has a reasonable set of categories as
>>well.  We just need less categories.
> 
> 
> Again, the categories we have at the moment are just the combination of the 
> sourceforge and freshmeat listings. I'm well aware it's not the best list 
> that it could be and I'd be more than happy to work on the list. 

Maybe it would even be sufficient to fill it out just a little more (to 
make it more appropriate for Python and libraries), and then offer to 
display a trimmed-down list, since everyone fills out their categories 
by skimming through the list of available categories.  Keywords are also 
another model, and are somewhat redundant now.  Myself, I never know 
what to put in for keywords.

An interactive setup.py-builder could be nice too -- it could help 
people get over the distutils hump, as well as promoting the necessary 
parts for PyPI submission.

>>I'd probably want to set up a automatic submission client that uses
>>docstrings, but that's a separate issue.
> 
> 
> I think this is a great idea (I'm a fan of lowest-possible-burden for 
> contributors ;)
> 
> 
> 
>>The idea of broad categories (application, library, module) may
>>alleviate the UI issues.
> 
> 
> Agreed.
> 
> 
> 
>>We already have enough fragmentation -- even 
>>the Vaults get new submissions that don't go to PyPI -- so I'd hate to
>>set up an entirely separate system.
> 
> 
> Aside: It really is a shame I got zero response from repeated enquiries about 
> collaboration with the Vaults people. I honestly didn't want to have to 
> develop a new system :(

Yeah, I was wondering about that.  It would still be nice to import the 
Vaults data, as there's a lot of older but useful stuff there.  Even if 
they didn't collaborate, at some point it would be nice to make PyPI 
more canonical, and import the Vaults data and (hopefully) shut that 
service down once it's entirely redundant.

I should really be careful, as I am prone to distraction (this itself is 
a distraction) and I probably can't participate in this stuff in a 
really consistent way.  But something like setting up an import might be 
a good level of commitment.

>>Sure, after a bit of back-and-forth here.  Maybe it would be easier to
>>just write something up to be put in docs/ in CVS.
> 
> 
> Which CVS?

In the pypi SF project.  Is that the canonical repository for the code?

   Ian