[Mailman-Developers] Scrubber.py confusion, 2.1b3

Michael Meltzer mjm@michaelmeltzer.com
Tue, 13 Aug 2002 01:00:39 -0400

Actually I "reusing" the code from Scrubber.py in MimeDel.py to turn
attachments into links :-) I hardwired it for image types but it is generic
enough. Some sample output from my "staging":

Name: beach.jpg Type: image/jpeg Size: 18853 bytes Desc: not_available

It turned out to be a 4 line hack to filter_parts, 1 line at the top and 10
lines to reformat the payload, the reset came from save_attachment, very
handle :-) I have to admit environment is nice to work in. I am not sure my
code it upto patch quality :-) The next step would be a modification to the
content filter page for the type it should react to.

I would also subject(Scrubber.py needs this too) that the filter pages list
the extensions that it is allow to write. Or the converse the extensions it
should not write,
http://office.microsoft.com/Assistance/2000/Out2ksecFAQ.aspx. would be my
start :-), save the masses someday :-)

The issue with the directory is the number of files, not a name clash,
`ls -d archives/private/listname/attachments/* | wc -l` > 1000 I think
system performance will be effected. Above 10,000 I know it would(it would
also be a problem for the http server on access). I can understand that
keeping the attachment from each email in it own directory, but this way the
"files version control" :-) groups them together for access(assuming least
regency theory) and make cleaning out for space/inodes simple. it was just
strftime wielded on.


----- Original Message -----
From: "Barry A. Warsaw" <barry@python.org>
To: "Michael Meltzer" <mjm@michaelmeltzer.com>
Cc: <Mailman-Developers@python.org>
Sent: Monday, August 12, 2002 8:00 PM
Subject: Re: [Mailman-Developers] Scrubber.py confusion, 2.1b3

> >>>>> "MM" == Michael Meltzer <mjm@michaelmeltzer.com> writes:
>     MM> I been going over some of the Scrubber.py code two thing are
>     MM> standing out for me
> Cool, someone's looking at it :)
>     MM> 1)A lot of work was made to make the filename unique in
>     MM> "save_attachment", it look like a straight bug that the url
>     MM> returned does not have the "extra" part returned as part of
>     MM> the url, looks to me like the last line should be
>     MM> url = baseurl + 'attachments/%s/%s' % (msgdir, filename +
>     MM> extra)
> It's certainly true that extra is never used once calculated.  That
> can't be useful. :)
>     MM> frankly I think the forming of the name could better, like
>     MM> filenamebase + "-" +counter + "." + ext, but that more of a
>     MM> feature request
> That was the intent, but the code's broken.
>     MM> 2)It looks like this code is doing directory abuse, it looks
>     MM> like a unlimited amount of files names fill be placed in one
>     MM> directory, like 2^32, this is not good for systems
>     MM> performance, even with the latest dirhash methods by the
>     MM> operating system ,this will become a linear screech very
>     MM> quickly for file creates and file exists. Been their and
>     MM> killed the patient that way. Hard to spot it until you ramp
>     MM> the systems up. I am playing around by adding two more time
>     MM> based directories to the system "attachments/YYYYMM/DD/". BTW
>     MM> that what made spotting bug #1 so easy :-)
> I agree that the directory calculation is broken.  It actually looks
> like a message with two attachments will end up in two different
> subdirs in archives/private/listname/attachments.  That wasn't the
> intent.  The idea was that each message would have a separate subdir
> in attachments and all its attachments would end up there.  So you'd
> only be in trouble on very high volume lists.  2**32 at 1000 msgs /
> day gives you about 11k years of running room.  If you were paranoid
> about 2**16 directories, then you might care about adding another
> level of directories.
> I'll work on fixing the code, and see how easy it is to add or change
> to the date-based directory.
> Thanks,
> -Barry