Re: [Mailman-Users] The Mysterious Disappearing Disk Space (fwd)

Nov. 20, 2008 · *tremendous*


      J.A. Terranson wrote:
...
On Tue, 18 Nov 2008, Mark Sapiro wrote:
...
J.A. Terranson wrote:
...
I have looked through the archives for something similar to my
issue, and I noticed that by searching on "disk full", I get similar
reports beginning in roughly July of 08.
As with these other reports, I have noticed *tremendous*
disappearing space.  When I tried to find the actual files, I was
unsuccessful.  Interestingly, if I stop mailman and then restart it, the
"missing" space miraculously reappears!
Is this Solaris? If so, see the thread beginning at
<http://mail.python.org/pipermail/mailman-users/2008-July/062359.html>
which is about an alleged memory leak.
FreeBSD 6.3, and the issue in the above thread doesn't look like the same
thing.
...
If you're running out of disk, and restarting the processes solves it,
it may be swap space that's eating up the disk.
I considered this, however swap is simply not being used at any time.  I
put up a cron job to monitor, and swap use is literally zero all the way
up to the crash for lack of space.  Honestly, it *feels* like some huge
log file somewhere, but I can find no viable explanation for it.  This is
a single partitioned box, with a single dedicated user (running 4 mailman
lists and a few websites).  Using du I see space being eaten, but no
indication as to where.   I see /usr and /var "growing", but looking into
them shows no file(s) that could account for the amount of missing space.
...
...
So, now that the background is over with, here's where I find
myself (besides just looking stupid):
(1) Yesterday I enabled VERP, and it appeared to be working well,  At the
time I turned on VERP, I had around 5gb of free space (which would take
about two weeks to "disappear" before VERP).
5gb of free disk space doesn't seem like a lot these days.
Agreed.  But then, they arent doing much either.
...
...
(2) Around 2pm today, the disk was full, and mailman died.
Enabling VERP might cause the MTA to use a lot more queue space, but I
don't see that it would affect Mailman much.
The only difference was in the rate at which the "loss" accrued.  Roughly
a 6x increase.
...
...
(3) My inkling of something being wrong was this on the web interface:
"Bug in Mailman version 2.1.11rc2
We're sorry, we hit a bug!
Please inform the webmaster for this site of this problem. Printing of
traceback and other system information has been explicitly inhibited,
but the webmaster can find this information in the Mailman error logs. "
(4) Upon looking at the system in response to the above missive, I checked
and saw the system ws out of space again.  I did what I always do - shut
down mailman (which usually drops ~5gb of "missing" space, and then
restart it.  Everything before today has come up roses doing this.
So are you saying that this time you didn't recover any disk space or
just that the web error didn't go away. If the latter, it seems likely
that the disk space error caused a config.pck file to be corrupted and
that is the cause of the recurrent "bug". What is the traceback from
the most recent of these from the error log?
I apologize for the lack of clarity.  Im saying that the space did come
back, as always, but this time was unique in throwing up this web message.

All of the mailman core functionality appeared to be running normally
(lots of traffic back and forth), but the web UI was dead.
...
...
(5) There is nothing in any of the logs that indicate why this message is
continuing to poke fun at me.
There almost certainly is something in Mailman's error log unless the
logfile just can't be grown to accommodate the message.
No, there really isn't.  I have combed through all of them (bounce, error,
mischeif, post qrunner, smtp & failure, subscribe and vette.  Did I miss
anything?), with no sign of anything being wrong.
I have several other mailman systems, and I have always seen a traceback
or slew of messages when something went south, but nothing here.  Also, of
note, this is the only mailman with the disappearing disk issue.  My other
boxen are all running *really* old versions, and this new customer build
is doubling as my canary:  so far, I see BIG improvements in throughput,
but this disk thing has me crazy.  If I stop mailman when the drive hits
99%, I instantly get my 5gb back.  It feels like Im writing a file that I
cannot see, but I dont think this is physically possible (anyone know
otherwise?).
Yes, this is very possible:
1. open a file.
2. write data to it.
3. delete it
if the file is not closed, the space will still be in use, but there
won't be any entry in the parent directory for it.  You can test for
this by cd-ing to the base of the file system which is running out of
space.  Run "du -dks .", then "df -k .".  The two usage numbers should
be the same, within a few k.  If different, then the used space is not
reflected in any directory.
If this is the case, you may be able to find out which process has the
open, unlinked file using "lsof".  Run it as "lsof -s -p PID" once for
each Mailman process.  The offender should report open files that
either it can't resolve the name or it will show a name that does
not exist.  The flag "-s" tells it to report the size.  This may help
identify a large file.  The ability of lsof to report the name of
open files may very by OS, however.
Rereading the man page for lsof, I just noticed the "+L" option.
Using "+aL1" (that is plus aye ell one) causes it to select unlinked
open files.  Perhaps this will help.
I hope this will help ID which process, at least.  Perhaps that will
give clues.
...
I spent a few hours mucking around with the pickles trying to figure what
broke, and finally gave up due to screaming users: I rebuilt.  The new
build acts *just* like the last one (the reason for the delay in answering
your kind reply was to see if the rebuild would get rid of this).  Ive
lost about a gig over 24 hours, and I have NO idea where its going.  I
stopped the job while writing this paragraph just to double check, and
yes, I get it all back when the job is terminated.  Very odd indeed.
Im not comfy with debuggers, so Im at the mercy of others.
Have I missed any log files?  Is there somewhere specific I should be
looking?  Is there some way to (easily) increase logging details to try
and track this down?
The answers to this and other important questions await.  On the next
episode of MailSoap. <cue jingle>
Seriously though, I appreciate your response, and the time spent on this.
All the best,
//Alif
--
Gary Algier, WB2FWZ          gaa at ulticom.com         +1 856 787 2758
Ulticom Inc., 1020 Briggs Rd, Mt. Laurel, NJ 08054  Fax:+1 856 866 2033
Nielsen's First Law of Computer Manuals:
People don't read documentation voluntarily.