Is the config.pck.last logic correct?

Our list server had another crash the other day, this time it really toasted a couple of lists. :( (No, we hadn't yet done any of the mitigation steps that we should've, at least none that worked....)
What happens is that some of the config.pck files get trashed by having the last part of the file overwritten with nul bytes. I'm assuming that it's a filesystem corruption causing this, perhaps involving disk hardware errors. By the time the problem is apparent, the config.pck and config.pck.last files are both trashed -- they're identical, with identical timestamps. I've looked at the logic for loading and saving config.pck, and don't see how this can happen. It seems like config.pck.last gets replaced only when the list data is saved, which should only happen at some point after the list data is successfully loaded. So there should be good data to generate the config.pck, otherwise config.pck.last should be left alone. But there seems to be some flaw in the logic, that I can't see, because both files are ending up trashed. Then again, I've clearly demonstrated an overwhelming stupidity by letting these crashes happen many times until finally something really nasty occurred, so maybe I'm just too stupid to look at the code.
Or maybe something entirely different is happening. If the pickle-save itself is corrupted in a way that isn't being caught, then I suppose the bad config.pck will happily by turned into an equally bad config.pck.last.
Obviously I don't have a reproducible test case for this, but maybe someone has some idea of what's going on, and how to improve the robustness.
-les

On Wed, 2004-02-04 at 14:09, Les Niles wrote:
Our list server had another crash the other day, this time it really toasted a couple of lists. :( (No, we hadn't yet done any of the mitigation steps that we should've, at least none that worked....)
What happens is that some of the config.pck files get trashed by having the last part of the file overwritten with nul bytes. I'm assuming that it's a filesystem corruption causing this, perhaps involving disk hardware errors. By the time the problem is apparent, the config.pck and config.pck.last files are both trashed -- they're identical, with identical timestamps. I've looked at the logic for loading and saving config.pck, and don't see how this can happen. It seems like config.pck.last gets replaced only when the list data is saved, which should only happen at some point after the list data is successfully loaded. So there should be good data to generate the config.pck, otherwise config.pck.last should be left alone. But there seems to be some flaw in the logic, that I can't see, because both files are ending up trashed. Then again, I've clearly demonstrated an overwhelming stupidity by letting these crashes happen many times until finally something really nasty occurred, so maybe I'm just too stupid to look at the code.
Or maybe something entirely different is happening. If the pickle-save itself is corrupted in a way that isn't being caught, then I suppose the bad config.pck will happily by turned into an equally bad config.pck.last.
Obviously I don't have a reproducible test case for this, but maybe someone has some idea of what's going on, and how to improve the robustness.
Have you tried setting SYNC_AFTER_WRITE=Yes in your mm_cfg.py file?
-Barry

On Sun, 08 Feb 2004 13:59:06 -0500 Barry Warsaw <barry@python.org> wrote:
On Wed, 2004-02-04 at 14:09, Les Niles wrote:
Our list server had another crash the other day, this time it really toasted a couple of lists. :( (No, we hadn't yet done any of the mitigation steps that we should've, at least none that worked....) ... Obviously I don't have a reproducible test case for this, but maybe someone has some idea of what's going on, and how to improve the robustness.
Have you tried setting SYNC_AFTER_WRITE=Yes in your mm_cfg.py file?
-Barry
No, not yet. That's one of the mitigation steps the lack of which demonstrates why I should be kept away from computers. But whether the writes succeed or fail or trash the file, I couldn't see how both config.pck and config.pck.last got corrupted. If it's some subtle bug in the program's logic, that might be worth fixing, but if it's just some serious nastiness on the part of the filesystem then nevermind.
BTW, I don't think I mentioned before, this is mailman 2.1.3 on FreeBSD 4.9-stable with a UFS filesystem.
-les

On Sun, 2004-02-08 at 14:10, Les Niles wrote:
On Sun, 08 Feb 2004 13:59:06 -0500 Barry Warsaw <barry@python.org> wrote:
Have you tried setting SYNC_AFTER_WRITE=Yes in your mm_cfg.py file?
-Barry
No, not yet.
I'd like to get a sense from folks as to whether I should turn on SYNC_AFTER_WRITE for MM 2.1.5. If so, should I keep the configuration variable or just hard-code enable it? Note that with the pending.pck and requests.pck rewrite I recently implemented, I'm always fsync'ing the files.
That's one of the mitigation steps the lack of which demonstrates why I should be kept away from computers. But whether the writes succeed or fail or trash the file, I couldn't see how both config.pck and config.pck.last got corrupted. If it's some subtle bug in the program's logic, that might be worth fixing, but if it's just some serious nastiness on the part of the filesystem then nevermind.
The only way I can see this happening is if the system calls succeed, but that the data gets corrupted before its flushed out to disk. So the writes and closes of the tmp files never raise exceptions, the rename dance is done, and then you're left with a corrupt .last file. If for some reason this is happening, turning on fsync should expose the problem because presumably, that call won't succeed unless the data is flushed to disk.
Try setting SYNC_AFTER_WRITE to Yes and restarting Mailman.
-Barry
participants (2)
-
Barry Warsaw
-
Les Niles