[Catalog-sig] start on static generation, and caching - apache config.

Mon Jul 9 18:44:45 CEST 2007

At 07:13 PM 7/9/2007 +0400, René Dudfield wrote:
>The way to do this atomically, so not one can possibly get an old
>page, the static file will be removed as the change is committed.
>Then everyone gets the latest change right away - as soon as the
>change has been committed.

This sounds pretty good...  except that you may need better 
protection against a race condition.  What happens if a page is 
removed *while* it is being regenerated?  PostgreSQL has MVCC for 
read-only transactions, so the static page will be generated against 
old data, unless you have some other locking mechanism used to 
serialize access to the static file, that is shared by both the 
deletion and generating mechanisms.

One possible approach: if the generator writes its files to 
foo/index.html.tmp (opened with exclusive access) and then renames 
them to 'foo/index.html', then the deletion mechanism can attempt to 
*first* remove the .tmp file, then the real file.  Both processes 
must be robust against their renames or unlinks or exclusive open()'s 
failing, but there would then be no possibility of collision.  The 
exclusive open would have to be done at the *start* of write 
processing, however, before any database queries have been 
attempted.  (And their connection must be rolled back at that 
point.)  This ensures that, if a writer succeeds in locking the .tmp 
file, then they are seeing data that is current.

All that having been said, the idea in general sounds good.  If PyPI 
itself simply checked whether the URL it's about to serve is 
cacheable (i.e., has a static location and no user logged in), and if 
so, opened the temp file for exclusive writing, it could just dump 
its generated page out, and rename it at the end if it had been 
successful in acquiring the temp file.

And voila!  No separate caching process, no scheduling, and an always 
perfectly-up-to-date cache.  As soon as a page becomes out of date, 
it gets served dynamically...  but only for as long as it takes to 
serve one copy of that page.  :)

In pseudocode:

     def process_request():
         if no authentication header and URL path is cacheable:
             try:
                 temp = exclusive open cache file with .tmp extension
             except os.error:
                 pass
             else:
                 with stdout redirected to temp:
                     process_request_normally()
                 try:
                     rename(tempfilename, realfilename)
                 except os.error:
                     pass
                 send_browser_contents_of(temp)
                 return

         return process_request_normally()

Here, 'process_request_normally()' should refer to everything that 
PyPI does now, *including database connection rollback or 
commit*.  This will ensure that it's impossible to write stale data 
to the cache.

The deletion process should just do this:

     for name in (cache_path+'.tmp', cache_path):
         try:
             os.unlink(name)
         except os.error:
             pass

after committing the database transaction.

Informal serialization proof:

* Only one process may write to a page's .tmp file at a time

* Either the writer has committed its page write (by renaming the 
.tmp file), or it has not (i.e., rename() is atomic)

* If the writer has *not* committed its page, then the first unlink 
will prevent it from doing so.

* If the writer *has* committed its page, then the second unlink will 
undo this.

* If between the two unlinks operations, another writer appears, that 
writer will be reading current data from the database, because it has 
to acquire exclusive access to the .tmp file before doing a rollback 
and reading the data it will use for writing.

QED, it will be impossible to have stale data in the cache, unless 
the invalidating request fails to attempt its two unlink operations 
during the brief window after its database commit.