I've just checked in a slew of changes affecting the archiver, hopefully fixing the more serious of the nasty performance problems we've been seeing. Jeremy Hylton deserves a lot of credit for his excellent analysis and patching of the code. He called his changes "a band-aid" because it's clear the archiver could still use more improvement, and in fact should go through a simplification and partial rewrite. We don't have the time for that now, so we'll go with the Hylton Band-aid for now.
I hope I can accurately outline the changes, but it's late and I'm tired so I might miss something. Please check the CVS log messages for details. I really hope some of you adventurous types will try running with these new changes. I intend to put them up on python.org (perhaps this weekend, or Monday) to see if it fixes the performance problems I've been seeing there.
First of all, Jeremy noticed that the way the archiver's .txt.gz file is created is very inefficient. It essentially reads the .txt and uses Python's gzip module to write the .txt.gz file -- for /every/ message that gets posted! This is a lot of work for not much benefit. So the first change is that the .gz file is not created by default (see mm_cfg.GZIP_ARCHIVE_TXT_FILES). We need to work in a crontab entry that gzip's the file nightly. Yes this means that the .txt.gz file will lag behind the on-line archive, but that should be a worthwhile tradeoff for better performance.
Jeremy also changed the DumbBTree implementation so that the keys are not sorted until some client starts to iterate through the data structure. This saves work when adding the initial elements. Remember a while back I added a clear() method to do a bulk clear of the items and this saved a lot of work.
Finally, Jeremy made some observations about the cache in the HyperDatabase. He says that since it traverses the elements in linear order, the lack of locality of reference means that evicting items from the cache doesn't really help, and in fact might hurt performance. So we now keep all the items in the cache (trade space for time). It might be worthwhile to get rid of the cache altogether, although it does serve a useful purpose currently. The DumbBTree is essentially a dictionary of pickles, and this whole structure is then marshaled. The cache keeps a hold of the unpickled objects. It might make sense then to make the DumbBTree a simple dictionary and just pickle it directly. Then the cache wouldn't be needed.
Jeremy has some other ideas about how to improve the archiver. I'm way too tired to outline them here. Jeremy will be out of the office for a week, so hopefully he'll be able to restore enough state when he gets back to post his ideas.
G'night, :) -Barry
participants (1)
-
Barry A. Warsaw