[Distutils] changelog / CDN inconsistency (was: Re: Good news everyone, PyPI is behind a CDN)

Noah Kantrowitz noah at coderanger.net
Mon May 27 22:34:21 CEST 2013


On May 27, 2013, at 1:20 PM, holger krekel wrote:

> On Mon, May 27, 2013 at 12:58 -0700, Noah Kantrowitz wrote:
>> On May 27, 2013, at 12:18 PM, holger krekel wrote:
>> 
>>> On Mon, May 27, 2013 at 14:59 -0400, Donald Stufft wrote:
>>>> On May 27, 2013, at 2:54 PM, holger krekel <holger at merlinux.eu> wrote:
>>>> 
>>>>> On Mon, May 27, 2013 at 13:50 -0400, Donald Stufft wrote:
>>>>>> On May 27, 2013, at 12:39 PM, Donald Stufft <donald at stufft.io> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On May 27, 2013, at 8:08 AM, holger krekel <holger at merlinux.eu> wrote:
>>>>>>> 
>>>>>>>> Hi Noah, Donald, (CC also Richard, Christian),
>>>>>>>> 
>>>>>>>> i just checked with a test package and think we might have a cache
>>>>>>>> consistency / changelog API problem.  It took me a while but here is 
>>>>>>>> the basic thing: I uploaded a test package, changelog API reports it has
>>>>>>>> changed, then i go to its simple page, and some of the time the new release
>>>>>>>> file shows up, sometimes not.
>>>>>>>> 
>>>>>>>> Tools like bandersnatch, pep381 and devpi-server (and probably others)
>>>>>>>> use PyPI's changelog API to determine if there are changes.  It seems
>>>>>>>> those changes are signalled faster than they become consistently accessible 
>>>>>>>> through the CDN.  This can lead to inconsistent mirrors because when 
>>>>>>>> the CDN has the files there is no change event anymore.  Such mirrors 
>>>>>>>> are run by companies in-house so i think it's a real problem.
>>>>>>>> 
>>>>>>>> Even without mirroring there can be problems because installs are not
>>>>>>>> directly repeatable: "pip install XYZ>=2.0" can give you first 2.0.1,
>>>>>>>> then 2.0.0 a minute later.  I had hoped that a particular ip address
>>>>>>>> sees things consistently.
>>>>>>>> 
>>>>>>>> I am not familiar with Fastly's caching properties -- can they notify
>>>>>>>> about the fact that a page/file is consistently up-to-date everywhere?  
>>>>>>>> Or can the cache be globally invalidated for a particular page/file?
>>>>>>>> Any other ideas?
>>>>>>>> 
>>>>>>>> Failing customizing Fastly usage and also maybe for the short term,
>>>>>>>> is/could there be a special location provided by pypi.python.org which
>>>>>>>> the above tools could use to get at the actual non-cached data?  We
>>>>>>>> could then maybe mitigate the problem through updates of the respective tools.
>>>>>>>> That would at least solve the problem for one of my customers i think.
>>>>>>>> 
>>>>>>>> best,
>>>>>>>> holger
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, May 26, 2013 at 10:34 -0700, Noah Kantrowitz wrote:
>>>>>>>>> </farnsworth>
>>>>>>>>> 
>>>>>>>>> but seriously, at long last today it was my honor to throw the DNS switch to move PyPI to the Fastly caching CDN. I would like to thank Donald Stufft for doing much of the heavy lifting on the PyPI side, and to Fastly for graciously offering to host us. What does this mean for everyone? Well the biggest change is PyPI should get a whole lot faster. There are two major downsides however. There will now be a delay of several minutes in some cases between updating a package and having it be installable, and download counts will now be even more incorrect than they were before. The PyPI admins are discussing what to do about download counts long-term, but for now we all feel that the performance and availability benefits outweigh the loss. If anyone has any questions, or hears anything about issues with PyPI please don't hesitate to contact me.
>>>>>>>>> 
>>>>>>>>> --Noah
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>>> 
>>>>>>> I mentioned it on twitter but might as well mention it here as well.
>>>>>>> 
>>>>>>> Currently there is no invalidation going on. The effect on the mirroring was unanticipated and I'm currently getting the invalidation API setup within PyPI.
>>>>>>> 
>>>>>>> -----------------
>>>>>>> Donald Stufft
>>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> /simple/ Pages should now be immediately invalidated when a new package is released.
>>>>> 
>>>>> thanks Donald.  Looking at the implementation, i wonder what happens if 
>>>>> after ``self._conn.commit()`` a changelog API call arrives, returns changes
>>>>> and a client uses it to retrieve changes before the fastly-purging takes 
>>>>> place.  It's still a potential race-condition or am i missing something?
>>>>> 
>>>>> best,
>>>>> holger
>>>>> 
>>>>>> -----------------
>>>>>> Donald Stufft
>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> There's no way around a race condition.
>>>> 
>>>> ``self._conn.commit()`` is what makes the changes available. If we purge prior to committing it then if someone hits the page between the purge and the self._conn.commit() then the client will see a page cached prior to the update (while the change log will appear to be updated). Essentially the same problem we have now.
>>>> 
>>>> The current implementation does mean that if a client happens to hit between the commit and the purge they'll see old data however that's pretty unlikely.
>>> 
>>> Purging can take a second and also depends on the network connectivity 
>>> between pypi.python.org and fastly's api to begin with.   I am afraid 
>>> the race-condition is bound to happen and then hard to detect.  
>>> 
>>> Not sure how exactly pypi.python.org is deployed but could commit() use
>>> a semaphore which also the changelog-APIs use so that the latter only
>>> returns after purging (and them some) has happened?  I don't think
>>> mirrors would mind sometimes waiting a few seconds before the changelog* call
>>> returns as long as the state is then consistent.
>>> 
>>> Lastly, i think introducing a bit of internal syncing overhead to commit()/
>>> changelog should be ok because we have only few writes and hardly read load.
>> 
>> Mirroring should not be affected by caching at all, as new packages mean new URLs (/pypi/name/version), so when you retrieve them there will be no cache issues. 
> 
> The simple/PROJ pages are changed, not newly created.  (and yes,
> new release files are not so much the problem because they are new
> and thus retrieved from fastly on first access).

Yes, pep381client is fundamentally incompatible with the future of PyPI's infrastructure. Sorry, this will not be changed at this point. If people would like to continue to operate mirrors, they will need to transition to use the API to access package information, fetch updated files, and rebuild any relevant index data. For example, this is how Donald's crate.io mirror operates. Using the current strategy of scraping the simple/ pages will continue to work, you just need to retry failed requests until they succeed (and check that the per-project pages match the version you expect from the change log, consider it a failure if they do not). This is just a stopgap though, and should not be considered a long-term solution.

> 
>> What I think you mean is this makes a race condition for pep381client,
>> however this is a bug in pep381client, not PyPI. If you would like to
>> submit a patch for a Paxos-based replication protocol, I'm sure Donald
>> and I would be happy to review it.
> 
> I am a bit lost of what you are talking about here.
> 
> The move to CDN broke things that worked before.  The changelog API
> reported changes that could not be seen afterwards.  This remains true
> after Donald's changes which just make it less likely but not impossible
> to happen.

They worked effectively by accident, not because it was correct. Had I understood how backwards the pep381 systems are, I would have alerted you all sooner, I apologize for this lapse. I am happy to talk about how to correctly use the PyPI API with anyone that has questions, or discuss more advanced replication options that will be free of race conditions in a distributed (yes, PyPI is a distributed database now) environment.

--Noah

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130527/7b6b2f01/attachment-0001.pgp>


More information about the Distutils-SIG mailing list