
On Wed, 2004-02-04 at 14:09, Les Niles wrote:
Our list server had another crash the other day, this time it really toasted a couple of lists. :( (No, we hadn't yet done any of the mitigation steps that we should've, at least none that worked....)
What happens is that some of the config.pck files get trashed by having the last part of the file overwritten with nul bytes. I'm assuming that it's a filesystem corruption causing this, perhaps involving disk hardware errors. By the time the problem is apparent, the config.pck and config.pck.last files are both trashed -- they're identical, with identical timestamps. I've looked at the logic for loading and saving config.pck, and don't see how this can happen. It seems like config.pck.last gets replaced only when the list data is saved, which should only happen at some point after the list data is successfully loaded. So there should be good data to generate the config.pck, otherwise config.pck.last should be left alone. But there seems to be some flaw in the logic, that I can't see, because both files are ending up trashed. Then again, I've clearly demonstrated an overwhelming stupidity by letting these crashes happen many times until finally something really nasty occurred, so maybe I'm just too stupid to look at the code.
Or maybe something entirely different is happening. If the pickle-save itself is corrupted in a way that isn't being caught, then I suppose the bad config.pck will happily by turned into an equally bad config.pck.last.
Obviously I don't have a reproducible test case for this, but maybe someone has some idea of what's going on, and how to improve the robustness.
Have you tried setting SYNC_AFTER_WRITE=Yes in your mm_cfg.py file?
-Barry