[Catalog-sig] start on static generation, and caching - apache config.
Phillip J. Eby
pje at telecommunity.com
Mon Jul 9 18:44:45 CEST 2007
At 07:13 PM 7/9/2007 +0400, René Dudfield wrote:
>The way to do this atomically, so not one can possibly get an old
>page, the static file will be removed as the change is committed.
>Then everyone gets the latest change right away - as soon as the
>change has been committed.
This sounds pretty good... except that you may need better
protection against a race condition. What happens if a page is
removed *while* it is being regenerated? PostgreSQL has MVCC for
read-only transactions, so the static page will be generated against
old data, unless you have some other locking mechanism used to
serialize access to the static file, that is shared by both the
deletion and generating mechanisms.
One possible approach: if the generator writes its files to
foo/index.html.tmp (opened with exclusive access) and then renames
them to 'foo/index.html', then the deletion mechanism can attempt to
*first* remove the .tmp file, then the real file. Both processes
must be robust against their renames or unlinks or exclusive open()'s
failing, but there would then be no possibility of collision. The
exclusive open would have to be done at the *start* of write
processing, however, before any database queries have been
attempted. (And their connection must be rolled back at that
point.) This ensures that, if a writer succeeds in locking the .tmp
file, then they are seeing data that is current.
All that having been said, the idea in general sounds good. If PyPI
itself simply checked whether the URL it's about to serve is
cacheable (i.e., has a static location and no user logged in), and if
so, opened the temp file for exclusive writing, it could just dump
its generated page out, and rename it at the end if it had been
successful in acquiring the temp file.
And voila! No separate caching process, no scheduling, and an always
perfectly-up-to-date cache. As soon as a page becomes out of date,
it gets served dynamically... but only for as long as it takes to
serve one copy of that page. :)
In pseudocode:
def process_request():
if no authentication header and URL path is cacheable:
try:
temp = exclusive open cache file with .tmp extension
except os.error:
pass
else:
with stdout redirected to temp:
process_request_normally()
try:
rename(tempfilename, realfilename)
except os.error:
pass
send_browser_contents_of(temp)
return
return process_request_normally()
Here, 'process_request_normally()' should refer to everything that
PyPI does now, *including database connection rollback or
commit*. This will ensure that it's impossible to write stale data
to the cache.
The deletion process should just do this:
for name in (cache_path+'.tmp', cache_path):
try:
os.unlink(name)
except os.error:
pass
after committing the database transaction.
Informal serialization proof:
* Only one process may write to a page's .tmp file at a time
* Either the writer has committed its page write (by renaming the
.tmp file), or it has not (i.e., rename() is atomic)
* If the writer has *not* committed its page, then the first unlink
will prevent it from doing so.
* If the writer *has* committed its page, then the second unlink will
undo this.
* If between the two unlinks operations, another writer appears, that
writer will be reading current data from the database, because it has
to acquire exclusive access to the .tmp file before doing a rollback
and reading the data it will use for writing.
QED, it will be impossible to have stale data in the cache, unless
the invalidating request fails to attempt its two unlink operations
during the brief window after its database commit.
More information about the Catalog-SIG
mailing list