How Python can have CPAN.
I have looked through the long discussion about the topic. As far as I can see these are the sensible, reasonable, concrete suggestions of what can be done to improve PyPI towards CPAN quality: 1. Ask those who do not upload why they don't upload, and see if we can fix it. 2. Ask those who do not want to upload why they don't provide a download URL. 3. Making it easier to mirror/replicate all metadata. 4. It should be easy to list the files of previous "hidden" versions for a package. 5. Have a --dist-file command for Distribute. 6. Better documentation. Is there anything I missed? -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64
On Sat, Dec 26, 2009 at 18:40, Lennart Regebro
3. Making it easier to mirror/replicate all metadata.
I've now written a script that does this via XML-RPC. It's dead easy, actually. It's however also slow, and looks like it takes at least five hours. But that's to download all metadata, and there is quite a lot of it. :) Syncing after the initial download is reasonably fast. I wonder if the initial slow download is a problem? It certainly makes it harder to run tests/queries on the complete dataset if you need to wait a day before you can do it. But if you want to set up a third-party service it shouldn't be great hindrance. It takes way longer than that to develop the service. :) -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64
On Dec 27, 2009, at 3:27 AM, Lennart Regebro wrote:
On Sat, Dec 26, 2009 at 18:40, Lennart Regebro
wrote: 3. Making it easier to mirror/replicate all metadata.
I've now written a script that does this via XML-RPC. It's dead easy, actually. It's however also slow, and looks like it takes at least five hours. But that's to download all metadata, and there is quite a lot of it. :) Syncing after the initial download is reasonably fast.
How hard would it be to set up a cron to tar up a daily snapshot so that the initial download was quick (no API calls), then you'd only need an update from the last snapshot? Thanks, S
On Sun, Dec 27, 2009 at 11:30, ssteinerX@gmail.com
How hard would it be to set up a cron to tar up a daily snapshot so that the initial download was quick (no API calls), then you'd only need an update from the last snapshot?
Not hard, I think. I haven't done a complete download, but the script I made simply read in all the data in memory and then dumped the huge dictionary to a pickle. Updates would be done by loading the dictionary into memory again, and then updating that dictionary with whatever happened since the last time, and dumping it to a pickle again. :) The biggest problem with that technique is the memory usage, but I didn't see it go up significantly during my one-hour test run, so I think it is in fact feasible. If not I guess each package could be dumped into the pickle separately, but that would make updating more complicated. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64
On Sat, Dec 26, 2009 at 18:40, Lennart Regebro
1. Ask those who do not upload why they don't upload, and see if we can fix it. 2. Ask those who do not want to upload why they don't provide a download URL.
Out of a total of 8522 packages on PyPI, there are 203 packages (2.4%) whose latest release does not provide either a package on PyPI, nor a download url. Of these 16 does not provide any contact data. There are, as far as I can figure out, around 150 individuals to contact. That's enough people that a questionnaire might be useful, with answer that we guess are going to be the common ones. Like: Why did you not upload to PyPI, you bum? 1. I didn't know you could upload to PyPI. 2. I don't want anyone but my servers to have the downloads, thank you. [Why?] 3. My companies policy does not allow it. [Why?] 4. The distutils "sdist upload" procedure doesn't work for me. [Why?] 5. Other. [What?] But there is a download URL metadata. You could have at least filled that in, lazy person! 1. I didn't know you could. 2. Other [What?] Is there significant interest in doing this? In that case, what answer options should we have? -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64
There are, as far as I can figure out, around 150 individuals to contact. That's enough people that a questionnaire might be useful, with answer that we guess are going to be the common ones. Like:
Why did you not upload to PyPI, you bum? 1. I didn't know you could upload to PyPI. 2. I don't want anyone but my servers to have the downloads, thank you. [Why?] 3. My companies policy does not allow it. [Why?] 4. The distutils "sdist upload" procedure doesn't work for me. [Why?] 5. Other. [What?]
But there is a download URL metadata. You could have at least filled that in, lazy person! 1. I didn't know you could. 2. Other [What?]
One answer option should be "there is nothing I can release at this point". Regards, Martin
On Dec 27, 2009, at 5:47 AM, Lennart Regebro wrote:
On Sat, Dec 26, 2009 at 18:40, Lennart Regebro
wrote: 1. Ask those who do not upload why they don't upload, and see if we can fix it. 2. Ask those who do not want to upload why they don't provide a download URL.
Out of a total of 8522 packages on PyPI, there are 203 packages (2.4%) whose latest release does not provide either a package on PyPI, nor a download url. Of these 16 does not provide any contact data.
I'll update the PyPI checker in the distutilsversion test_pypi_versions.py to provide that information as well. I think I'll make a separate project (uploaded to PyPI, promise) out of it so we have a shared way to get statistics on PyPI projects. Is your code for this in the PyPI code repository or were these just quick one-off's? If there's no data on PyPI and no download url then wouldn't those be "non-packages?" And if there's no contact info, "non-package" by "nobody?" Sounds like a song title. The survey should include a "Project abandoned" or "Nothing to upload" or "It was just a mistake" and an "Ok to delete immediately". Any non-project where the owner gives permission to delete, where the e-mail bounces, or there's no response in a month, just delete it. That 2.4% is pretty small compared to somewhere like SourceForge where every other project seems to be abandoned or worse and there sure are a lot of them... S
On Sun, Dec 27, 2009 at 16:54, ssteinerX@gmail.com
I'll update the PyPI checker in the distutilsversion test_pypi_versions.py to provide that information as well. I think I'll make a separate project (uploaded to PyPI, promise) out of it so we have a shared way to get statistics on PyPI projects. Is your code for this in the PyPI code repository or were these just quick one-off's?
Quick one-offs. Here, in fact: Takes an hour or two to run. from xmlrpc import client PYPIURL = 'http://pypi.python.org/pypi' pypi = client.Server(PYPIURL) for package in pypi.list_packages(): for release in pypi.package_releases(package, False): release_urls = pypi.release_urls(package, release) if release_urls: continue release_data = pypi.release_data(package, release) if not release_data.get('download_url', ''): print("Package %s, Release: %s" % (package, release)) print(" Did not have releases on PyPI or download URL") print(" Author: %s <%s>" % (release_data['author'], release_data['author_email'])) print(" Maintainer: %s <%s>" % (release_data['maintainer'], release_data['maintainer_email']))
If there's no data on PyPI and no download url then wouldn't those be "non-packages?" And if there's no contact info, "non-package" by "nobody?" Sounds like a song title.
:)
The survey should include a "Project abandoned" or "Nothing to upload" or "It was just a mistake" and an "Ok to delete immediately".
Right.
Any non-project where the owner gives permission to delete, where the e-mail bounces, or there's no response in a month, just delete it.
There may be more people who have the rights to the system, so in fact in these cases we should check who has Owner rights and contact all of them. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64
On Dec 27, 2009, at 11:27 AM, Lennart Regebro wrote:
On Sun, Dec 27, 2009 at 16:54, ssteinerX@gmail.com
wrote: I'll update the PyPI checker in the distutilsversion test_pypi_versions.py to provide that information as well. I think I'll make a separate project (uploaded to PyPI, promise) out of it so we have a shared way to get statistics on PyPI projects. Is your code for this in the PyPI code repository or were these just quick one-off's?
Quick one-offs. Here, in fact:
Takes an hour or two to run.
from xmlrpc import client PYPIURL = 'http://pypi.python.org/pypi' pypi = client.Server(PYPIURL) for package in pypi.list_packages(): for release in pypi.package_releases(package, False): release_urls = pypi.release_urls(package, release) if release_urls: continue release_data = pypi.release_data(package, release) if not release_data.get('download_url', ''): print("Package %s, Release: %s" % (package, release)) print(" Did not have releases on PyPI or download URL") print(" Author: %s <%s>" % (release_data['author'], release_data['author_email'])) print(" Maintainer: %s <%s>" % (release_data['maintainer'],
release_data['maintainer_email']))
Ok, thanks, I'll throw that into the code, in some form. The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy. Right now, my query is very limited (I'm only looking for version info) and only takes a couple of minutes to build. Since I'm going to add more capabilities, having a quick way to refresh the whole thing would be great. I'll put my version up in the new project and maybe we can work together to get it into some PyPI code or to store the version I build somewhere though building it right on the server would seem to be much faster (if memory intensive). Thanks! S aka/Steve Steiner aka/ssteinerX
On Dec 27, 2009, at 1:34 PM, Martin v. Löwis wrote:
The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy.
How difficult would it be for you to provide such data, for anybody interested in using them?
Really easy, that's what I'm saying, let's cooperate in getting this put somewhere. I have plenty of hosting resources and can put a cron job up to keep it up to date. S
On Sun, Dec 27, 2009 at 17:50, ssteinerX@gmail.com
The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy.
Right... Anyone want to host such a tarred download? :-) I pretty much got the code, but don't feel like setting up and hosting it right now. Maybe later. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64
On Dec 27, 2009, at 2:10 PM, Lennart Regebro wrote:
On Sun, Dec 27, 2009 at 17:50, ssteinerX@gmail.com
wrote: The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy.
Right... Anyone want to host such a tarred download? :-) I pretty much got the code, but don't feel like setting up and hosting it right now. Maybe later.
I'll finish up the code, get it set up on a cron job to keep it up to date, and host it. I'll probably just put the finished data up on S3, no big deal and I can run the cron job on one of our servers. S
On Sun, Dec 27, 2009 at 17:50, ssteinerX@gmail.com
wrote: The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy.
Right... Anyone want to host such a tarred download? :-) I pretty much got the code, but don't feel like setting up and hosting it right now. Maybe later.
I have some spare on a custom domain. What do you need ? David
On Dec 27, 2009, at 6:42 PM, david.lyon@preisshare.net wrote:
On Sun, Dec 27, 2009 at 17:50, ssteinerX@gmail.com
wrote: The tarred download would be really handy for this utility as, if there's no .pkl of the data, or the user requests it, I pull fresh copy.
Right... Anyone want to host such a tarred download? :-) I pretty much got the code, but don't feel like setting up and hosting it right now. Maybe later.
I have some spare on a custom domain. What do you need ?
I have it covered. S
On Sun, Dec 27, 2009 at 11:47, Lennart Regebro
Out of a total of 8522 packages on PyPI, there are 203 packages (2.4%) whose latest release does not provide either a package on PyPI, nor a download url. Of these 16 does not provide any contact data.
Hi Lennart, Glad to see someone is interested by a PyPI mirror, I have one here and it's a pity. Statistics (from the creation of the mirror / proxy. The goal is to avoid external download, like an internal debian mirror): 2009-12-15 21:37:20,855 DEBUG Found (cached): 0 2009-12-15 21:37:20,855 DEBUG Stored (downloaded): 15367 2009-12-15 21:37:20,855 DEBUG Not found (404): 188 2009-12-15 21:37:20,855 DEBUG Invalid packages: 0 2009-12-15 21:37:20,855 DEBUG Invalid URLs: 54 2009-12-15 21:37:20,855 DEBUG Runtime: 208m38s The root issue (for me) is: packages out of the PyPI. A lot of broken links, broken html pages or stupid scripts (cf. old SourceForge). Some examples: WARNING Unload downloading http://wiki.woodpecker.org.cn/moin/UliPad (timed out) WARNING Unload downloading http://launchpad.net/mcrepogen/+download (The read operation timed out) WARNING Unload downloading http://launchpad.net/mcrepogen (The read operation timed out) WARNING Unload downloading https://launchpad.net/lovely.tal (The read operation timed out) WARNING Unload downloading ffnet.sourceforge.net (unknown url type: ffnet.sourceforge.net) WARNING Unload downloading http://pysqlite.org/ ((-3, 'Temporary failure in name resolution'))
Is there significant interest in doing this?
YES! ;) In that case, what answer
options should we have?
Always upload a version to PyPI, the only way to have a reliable,
solid and smart PyPI and an easy way to proxy-ing. Think the case
where SF is down: No docutils. Zope server down: no Zope 2, no Zope3,
no ZTK, no buildout... With a full mirror I don't care...
Note: I'm very happy when I see a distribution with:
- a description
- a summary (with examples if necessary)
- a changelog (quick way to see what's new)
- the name of the author and email (or maintainer)
- contain files (with distribution name = package name, not MyPackage
and mypackage)
Like this :
http://pypi.python.org/pypi/collective.portlet.relateditems/0.3.0
And not this:
http://pypi.python.org/pypi/django-sphinxdoc/0.2
Cheers
--
Sebastien Douche
On Dec 30, 2009, at 1:48 PM, Sebastien Douche wrote:
On Sun, Dec 27, 2009 at 11:47, Lennart Regebro
wrote: Out of a total of 8522 packages on PyPI, there are 203 packages (2.4%) whose latest release does not provide either a package on PyPI, nor a download url. Of these 16 does not provide any contact data.
Hi Lennart, Glad to see someone is interested by a PyPI mirror, I have one here and it's a pity.
Statistics (from the creation of the mirror / proxy. The goal is to avoid external download, like an internal debian mirror): 2009-12-15 21:37:20,855 DEBUG Found (cached): 0 2009-12-15 21:37:20,855 DEBUG Stored (downloaded): 15367 2009-12-15 21:37:20,855 DEBUG Not found (404): 188 2009-12-15 21:37:20,855 DEBUG Invalid packages: 0 2009-12-15 21:37:20,855 DEBUG Invalid URLs: 54 2009-12-15 21:37:20,855 DEBUG Runtime: 208m38s
The root issue (for me) is: packages out of the PyPI. A lot of broken links, broken html pages or stupid scripts (cf. old SourceForge).
I will put a way of getting this data out, thanks for the heads up.
Some examples: WARNING Unload downloading http://wiki.woodpecker.org.cn/moin/UliPad (timed out) WARNING Unload downloading http://launchpad.net/mcrepogen/+download (The read operation timed out) WARNING Unload downloading http://launchpad.net/mcrepogen (The read operation timed out) WARNING Unload downloading https://launchpad.net/lovely.tal (The read operation timed out) WARNING Unload downloading ffnet.sourceforge.net (unknown url type: ffnet.sourceforge.net) WARNING Unload downloading http://pysqlite.org/ ((-3, 'Temporary failure in name resolution'))
Is there significant interest in doing this?
YES! ;)
In that case, what answer
options should we have?
Always upload a version to PyPI, the only way to have a reliable, solid and smart PyPI and an easy way to proxy-ing. Think the case where SF is down: No docutils. Zope server down: no Zope 2, no Zope3, no ZTK, no buildout... With a full mirror I don't care...
Note: I'm very happy when I see a distribution with: - a description - a summary (with examples if necessary) - a changelog (quick way to see what's new) - the name of the author and email (or maintainer) - contain files (with distribution name = package name, not MyPackage and mypackage)
Like this : http://pypi.python.org/pypi/collective.portlet.relateditems/0.3.0
And not this: http://pypi.python.org/pypi/django-sphinxdoc/0.2
I have put this into my working spec document which I'll be publishing with the first version of the code (which won't have all the options implemented, but they'll be in the plan/issue tracker in case anyone wants to help). Steve
On 12/30/2009 10:57 AM, ssteinerX@gmail.com wrote:
On Dec 30, 2009, at 1:48 PM, Sebastien Douche wrote:
On Sun, Dec 27, 2009 at 11:47, Lennart Regebro
wrote: Out of a total of 8522 packages on PyPI, there are 203 packages (2.4%) whose latest release does not provide either a package on PyPI, nor a download url. Of these 16 does not provide any contact data.
Hi Lennart, Glad to see someone is interested by a PyPI mirror, I have one here and it's a pity.
Statistics (from the creation of the mirror / proxy. The goal is to avoid external download, like an internal debian mirror): 2009-12-15 21:37:20,855 DEBUG Found (cached): 0 2009-12-15 21:37:20,855 DEBUG Stored (downloaded): 15367 2009-12-15 21:37:20,855 DEBUG Not found (404): 188 2009-12-15 21:37:20,855 DEBUG Invalid packages: 0 2009-12-15 21:37:20,855 DEBUG Invalid URLs: 54 2009-12-15 21:37:20,855 DEBUG Runtime: 208m38s
The root issue (for me) is: packages out of the PyPI. A lot of broken links, broken html pages or stupid scripts (cf. old SourceForge). I will put a way of getting this data out, thanks for the heads up.
Greetings Sebastien and Steve, The way of getting [external packages] was already implemented. It is called `setuptools.package_index` which is what we use in our internal mirror program (planning to open-source and, perhaps also, host it publicly) which also does the metadata extraction (PKG-INFO, requires.txt) and index files that I mentioned earlier. It is of no use to pity z3c.pypimirror or any other mirror program, because the issue is not with those programs, but with the lack of a central archive from which all sources and metadata can be reliably mirrored. I will, once again, draw the reader's attention to the following: [Steffen Mueller]
My thesis is that the huge success of the CPAN has been facilitated by two factors[2]. The first is simplicity. When Jarkko Hietaniemi originally came up with it, the CPAN was (and mostly still is) just an FTP archive with a by-author directory structure that is mirrored many times. http://www.mail-archive.com/distutils-sig@python.org/msg10537.html
-srid
On Wed, Dec 30, 2009 at 19:48, Sebastien Douche
Is there significant interest in doing this?
YES! ;)
In that case, what answer
options should we have?
Always upload a version to PyPI, the only way to have a reliable,
The question was if there was interest in sending out a questionnaire to maintainers. Forcing uploads to PyPI is a debate that has been flogged to death. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64
On Wed, Dec 30, 2009 at 19:48, Sebastien Douche
wrote: Is there significant interest in doing this?
YES! ;)
In that case, what answer
options should we have?
Always upload a version to PyPI, the only way to have a reliable,
The question was if there was interest in sending out a questionnaire to maintainers. Forcing uploads to PyPI is a debate that has been flogged to death.
In this day and age it just may not viable to do that. If PEP-345 could be adjusted to have a code a Code-Repository option then it wouldn't be so difficult to use a bot on pypi to pull code *in*, test it and package it. Developers don't always have time to drop back to a command line and build and upload using a command line tool that takes 30 seconds. Especially already after they have done a 'hg push' or 'svn commit..' to their own repository. I'd hazzard a guess but I'd say that 80% of pypi projects would be better served with a (external) code repository reference than actually keeping everything built on pypi. And asking the package creators to do that. Here, I don't want to throw away pypi. Clearly it needs to stay and retain its traditional operating mode. I'm just making the point that a simpler Metadata based solution could might serve the needs of users more. David
On Wed, Dec 30, 2009 at 22:39, Lennart Regebro
The question was if there was interest in sending out a questionnaire to maintainers.
Sorry Lennart. I think itś a good step. Go ahead ;).
--
Sebastien Douche
On Wed, 30 Dec 2009 19:48:34 +0100, Sebastien Douche
Glad to see someone is interested by a PyPI mirror, I have one here and it's a pity.
How did you make it?
Note: I'm very happy when I see a distribution with: - a description - a summary (with examples if necessary) - a changelog (quick way to see what's new) - the name of the author and email (or maintainer) - contain files (with distribution name = package name, not MyPackage and mypackage)
Sure.
Like this : http://pypi.python.org/pypi/collective.portlet.relateditems/0.3.0
I just had a quick look at that package, just as an example. Next problem is that it is a python 2.4 Egg.. That is a real problem... what about for *my* whatever version of python x.y. The process of user confusion now starts on what to do and how to get that installed.. We are still not at anything nearing the simplicity of cpan. But it is possible to do something about it - I hope. David
participants (7)
-
"Martin v. Löwis"
-
David Lyon
-
david.lyon@preisshare.net
-
Lennart Regebro
-
Sebastien Douche
-
Sridhar Ratnakumar
-
ssteinerX@gmail.com