Greetings,
We had a series of longer than battery outages on wednesday, and
when we came back online, mailman refused to run. Heres the messages:
mailman# tail -f error Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.pck.last 1778451844
Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.db [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db'
Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.db.last [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db.last'
Oct 30 02:28:06 2010 (11862) All [listname]B fallbacks were corrupt, giving up
Oct 30 02:28:06 2010 (11862) error opening list: [listname] [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db.last'
Please tell me theres a reasonable recovery for this on 2.1.21 (plus a patch)?
Thanks!
//Alif
J.A. Terranson wrote:
We had a series of longer than battery outages on wednesday, and when we came back online, mailman refused to run. Heres the messages:
mailman# tail -f error Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.pck.last 1778451844
Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.db [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db'
Oct 30 02:28:06 2010 (11862) couldn't load config file /usr/mailman/lists/[listname]/config.db.last [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db.last'
Oct 30 02:28:06 2010 (11862) All [listname]B fallbacks were corrupt, giving up
Oct 30 02:28:06 2010 (11862) error opening list: [listname] [Errno 2] No such file or directory: '/usr/mailman/lists/[listname]/config.db.last'
Please tell me theres a reasonable recovery for this on 2.1.21 (plus a patch)?
2.1.12?
The recovery is to restore the corrupt /usr/mailman/lists/[listname]/config.pck from the most recent good backup.
Mailman does the best it can by trying to first write config.pck.tmp.<hostname>.<pid> and then removing config.pck.last, moving config.pck to config.pck.last and finally moving config.pck.tmp.<hostname>.<pid> to config.pck.
You could check for a /usr/mailman/lists/[listname]/config.pck.tmp.<hostname>.<pid> file and try moving that to /usr/mailman/lists/[listname]/config.pck if it exists, but even if it does, it may be bad too.
It appears that your system is caching disk writes to the extent that both the config.pck and config.pck.last were incompletely written when the power failed. You might look into that, and also consider setting
SYNC_AFTER_WRITE = Yes
in mm_cfg.py (see the documentation of this in Defaults.py).
On Sat, 30 Oct 2010, Mark Sapiro wrote:
/usr/mailman/lists/[listname]/config.pck.tmp.<hostname>.<pid> file and try moving that to /usr/mailman/lists/[listname]/config.pck if it exists, but even if it does, it may be bad too.
Thanks for the terrible news :-/ Is my [completely off the cuff] understanding of the config file correct that it just holds the settings from the config pages? If so, is there any reason I cannot "create" a new list (after moving the existing one) with the same name and then sub the new config.pck for the old corrupted one?
Thanks for the flush information: even though this was a unique situation (every box in the rack was damaged, several beyond repair [including the backup server]), living in the middle of the tornado capital of the world means it could realistically happen again, so thats a really good one to know.
//Alif
J.A. Terranson wrote:
/usr/mailman/lists/[listname]/config.pck.tmp.<hostname>.<pid> file and try moving that to /usr/mailman/lists/[listname]/config.pck if it exists, but even if it does, it may be bad too.
Thanks for the terrible news :-/ Is my [completely off the cuff] understanding of the config file correct that it just holds the settings from the config pages? If so, is there any reason I cannot "create" a new list (after moving the existing one) with the same name and then sub the new config.pck for the old corrupted one?
Assuming you don't use some kind of custom member adaptor, the config.pck also contains all the list membership information.
You can try running 'strings' on the various config.pck* files to see if you can extract useful information that way.
On Sat, 30 Oct 2010, Mark Sapiro wrote:
J.A. Terranson wrote:
/usr/mailman/lists/[listname]/config.pck.tmp.<hostname>.<pid> file and try moving that to /usr/mailman/lists/[listname]/config.pck if it exists, but even if it does, it may be bad too.
Thanks for the terrible news :-/ Is my [completely off the cuff] understanding of the config file correct that it just holds the settings from the config pages? If so, is there any reason I cannot "create" a new list (after moving the existing one) with the same name and then sub the new config.pck for the old corrupted one?
Assuming you don't use some kind of custom member adaptor, the config.pck also contains all the list membership information.
You can try running 'strings' on the various config.pck* files to see if you can extract useful information that way.
Everything seems to be in there: I see all the settings, plus it looks like all the subscribers are there as well. It's going to be *really* painfull manually restoring all ~1200 addresses and their information, but I suppose it will have to do.
Whats really odd is that only one (of approx 30 lists) was corrupted, and when I compare the output of the corrupted vs known goos files, they look roughly identical. Clearly they aren't, but... Is there any way to tell "where" mailman thinks the corruption begins or is it just the absence of a clean flag somewhere that I am hosed on?
//Alif
J.A. Terranson wrote:
On Sat, 30 Oct 2010, Mark Sapiro wrote:
You can try running 'strings' on the various config.pck* files to see if you can extract useful information that way.
Everything seems to be in there: I see all the settings, plus it looks like all the subscribers are there as well. It's going to be *really* painfull manually restoring all ~1200 addresses and their information, but I suppose it will have to do.
Whats really odd is that only one (of approx 30 lists) was corrupted, and when I compare the output of the corrupted vs known goos files, they look roughly identical.
Probably that one list was the only one that was being or had 'recently' been updated when the power was removed.
Clearly they aren't, but... Is there any way to tell "where" mailman thinks the corruption begins or is it just the absence of a clean flag somewhere that I am hosed on?
Mailman doesn't have a clue as to what the problem is. The file is a Python pickle and all Mailman knows is its attempt to cPickle.load() the file threw an exception or didn't return a Python dictionary.
See http://docs.python.org/library/pickle.html.
Also, you may be able to use some of the pickletools functions to help determine what might be wrong with the file and how to fix it, See http://docs.python.org/library/pickletools.html. Also, see the text at the beginning of the /usr/lib/pythonx.x/pickletools.py file for documentation of the pickle file format. Note that config.pck is a "protocol 1" pickle.
The first step would be to run the following Python commands. It is best to run them under withlist as the unpickling process may need access to Mailman.
$ /usr/mailman/bin/withlist -i (some output)
import cPickle fp = open('/usr/mailman/lists/[listname]/config.pck') xxx = cPickle.load(fp)
(some output about the error)
(enter control-D here to exit)
Then, depending on the error, you might be able to use pickletools or just a dump of the file to see what the problem is.