[Catalog-sig] [PSF-Board] Troubled by changes to PyPI usage agreement

Thu Jan 21 20:08:18 CET 2010

Tarek Ziadé wrote:
> On Thu, Jan 21, 2010 at 5:29 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> [..]
>>
>> Sure, we could do all those things, but such a process will
>> cause a lot of admin overhead on part of the PSF.
> 
> Which process ? the non-web mirroring requires no effort/work from the PSF.
> 
> The only effort that is required is technical, and 70% done at this
> point I'd say.

No, it's not only technical. The administration overhead
comes into play when looking at the legal side of things
and that does not only require doing initial checks of
whether the mirrors are adhering to a set of technical
standards, but also whether they adhere to legal ones.

We'd avoid all that with e.g. a cloud setup run by the
PSF and get all the monitoring, statistics, etc. for free.

>> Using a content delivery system we'd avoid such administration
>> work. The PSF would have to sign agreements with 10-20 mirror
>> providers, it wouldn't have to setup a monitoring system, keep
>> checking the mirror web content, etc.
> 
> What is a content delivery system here ? do you mean by that that the PSF
> would run the mirror by itself ? if so, how this is going to work technically ?
> how would it be different ?

See e.g. http://aws.amazon.com/cloudfront/ for such a system.

Using Amazon would even also the PSF to run the web front end
based on the same system and data.

The main advantage is that they take care of all the sys admin
stuff and provide a unified platform to work with.

> Let me state it differently : what if each mirror maintainer is a PSF member ?
> does that addresse the legal/admin issues ?

Perhaps the legal ones (can't say, we'd have to ask Van),
but certainly not the admin issues:

Each mirror maintainer would have to invest time in setting up
such a server, so instead of simplifying things, we'd make them
more work intense... and then you have to manage software updates
on those servers, fight network problems, handle differences
between server platforms and OSes, implement fail-over,
etc. etc. - basically all the usual operations stuff needed
to maintain a distributed cluster.

With a service like Amazon or Akamai you just have one platform
and/or API to worry about. All the sys admin overhead is done by
others and edge distribution comes right with the service.

>> Moreover, there would also be mirrors in parts of the world
>> that are currently not well covered by Pythonistas and thus
>> less likely to get a local mirror server setup.
> 
> This is just a matter of having a server IP in that part of the world.

You'd also have to have the data in that part of the world
if you want to benefit from a local mirror. That's what edge
distribution is all about.

> And in reality, as long as the main areas US, Europe, Austrialia, etc
> are served,
> this fits our needs. But some people will probably have to go through
> several nodes
> to reach a mirror, but we can't have a server per major city.
> 
> So in any case, we are improving the situation, not making it worse.

That's not what I'm saying. I hope I'm making myself clear.
If not please let me know.

What I'm trying to say is that it is more effective to look at
existing solutions to these standard problems and work from
there instead of reinventing yet another distribution network
mechanism from ground up.

This will save you a lot of work and it will also simplify
the legalese and administration from the PSF side of things.

>> How to arrange all this is really a PSF question more than
>> anything else.
>>
>> Also note that using a static file layout would make the
>> whole synchronization mechanism a lot easier - not only
>> for content delivery networks, but also for dedicated
>> volunteer run mirrors. There are lots of mirror scripts
>> out there that work with rsync or FTP, so no need to reinvent
>> the wheel.
>>
> 
> Those scripts already exist and are in usage in the tools that are
> mirroring pypi. They are not rsync but http calls, but that's about it.

Ok, so that wheel has already been reinvented :-)

>> AFAICTL, all the data on PyPI is static and can be rendered
>> as files in a directory dump. A simple cronjob could take
>> care of this every few minutes or so and extract the data
>> to a local directory which is then made accessible to
>> mirrors.
> 
> People are already doing rsync-like mirrors. But that's quite an
> incomplete mirror.

Why is that ? Why can't the PyPI data be extracted to the
file system to make it more accessible to standard tools ?

> The whole point of the work I've been doing with Martin (partially
> reflected in PEP 381)
> is to be able to have the download statistics for each archive, no
> matter which mirror was used to download the file.  That's quite a
> valuable information.

Indeed and Amazon will provide those to you without having
to do any extra work:

http://aws.amazon.com/about-aws/whats-new/2009/05/07/amazon-cloudfront-adds-access-logging-capability/
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2440

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 21 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/