[Mailman-Developers] Scrubber.py confusion, 2.1b3

Michael Meltzer mjm@michaelmeltzer.com
Wed, 14 Aug 2002 05:02:31 -0400

save_attachment is looking good, "Cool", my only gripe is the url are getting very long, 80 column wrap will be an ongoing issue and
most likely unsolvable. I am not married to the path issue/usage I used.  I did have a problem with after 3 years by using the fully
qualified date their would be over 1000 files in one directory.

I am not sure about white vs. black list. The white list is nice because I know what type will pass thought, but will have the
problem of playing catch up with new type's, hassle factor for the admin's and questions from new users. The black list is nice but
will I wake up one mooring and read about the "latest hole" that is being exploited, could ruin a whole day ;-) Pondering it, I
suspect a white list with a good set of defaults should work. I kind of like the "get the extension form mime type" but it broke
down as soon as I tried to attach a "word" document, came up a application/octet-stream with only the extension as a clue. I like
the method but I do not think it will last, we will end back up at lists(or maybe a real opensource anti-virus :-)

PS. I am sure I will get the pointy hat award for the patch below :-) I also have it running on the test server at
http://www.michaelmeltzer.com/mailman/listinfo/meltzer-list , it open(at least for a few day :-), if anyone want to past some
traffic thought it and see the output..............Just do not flood it out.

Index: MimeDel.py
RCS file: /cvsroot/mailman/mailman/Mailman/Handlers/MimeDel.py,v
retrieving revision 2.1
diff -u -r2.1 MimeDel.py
--- MimeDel.py 18 Apr 2002 20:46:53 -0000 2.1
+++ MimeDel.py 14 Aug 2002 08:21:58 -0000
@@ -33,7 +33,9 @@
 from Mailman import Errors
 from Mailman.Logging.Syslog import syslog
 from Mailman.Version import VERSION
+from Mailman.Handlers.Scrubber import save_attachment
+from time import strftime
+from Mailman.i18n import _

 def process(mlist, msg, msgdata):
@@ -41,6 +43,7 @@
     if not mlist.filter_content or not mlist.filter_mime_types:
     # We also don't care about our own digests or plaintext
+    make_attachment(mlist, msg)
     ctype = msg.get_type('text/plain')
     mtype = msg.get_main_type('text')
     if msgdata.get('isdigest') or ctype == 'text/plain':
@@ -54,7 +57,7 @@
     if msg.is_multipart():
         # Recursively filter out any subparts that match the filter list
         prelen = len(msg.get_payload())
-        filter_parts(msg, filtertypes)
+        filter_parts(mlist, msg, filtertypes)
         # If the outer message is now an emtpy multipart (and it wasn't
         # before!) then, again it gets discarded.
         postlen = len(msg.get_payload())
@@ -96,7 +99,7 @@

-def filter_parts(msg, filtertypes):
+def filter_parts(mlist, msg, filtertypes):
     # Look at all the message's subparts, and recursively filter
     if not msg.is_multipart():
         return 1
@@ -104,9 +107,12 @@
     prelen = len(payload)
     newpayload = []
     for subpart in payload:
-        keep = filter_parts(subpart, filtertypes)
+        keep = filter_parts(mlist, subpart, filtertypes)
         if not keep:
+ if make_attachment(mlist, subpart):
+            newpayload.append(subpart)
+     continue
         ctype = subpart.get_type('text/plain')
         mtype = subpart.get_main_type('text')
         if ctype in filtertypes or mtype in filtertypes:
@@ -164,3 +170,32 @@
         changedp = 1
     return changedp
+def make_attachment(mlist, subpart):
+     #should be set from mlist, work in progress
+     #BTW this will act real stupid with mulipart, it need the real object not the house keeping
+    attach_filter = ['image/bmp', 'image/jpeg', 'image/tiff', 'image/gif', 'image/png', 'image/pjpeg', 'image/x-png',
+    ctype = subpart.get_type('text/plain')
+    mtype = subpart.get_main_type('text')
+    if ctype in attach_filter or mtype in attach_filter:
+ cctype = subpart.get_type()
+ #size is off, just could not stand to call decode to correct, might just take off 20% and be done
+        size = len(subpart.get_payload())
+        desc = subpart.get('content-description', (_('not available')))
+        filename = subpart.get_filename(_('not available'))
+ url = save_attachment(mlist, subpart, strftime("attch/%Y%m/%d"))
+ del subpart['content-type']
+ del subpart['content-transfer-encoding']
+        del subpart['content-disposition']
+        del subpart['content-description']
+ subpart.add_header('Content-Type', 'text/plain', charset='us-ascii')
+ subpart.add_header('Content-Transfer-Encoding', '7bit')
+ subpart.set_payload(_("""\
+Name: %(filename)s Type: %(cctype)s Size: %(size)d bytes Desc: %(desc)s
+Url: %(url)s
+        return 1
+    else:
+        return 0

----- Original Message -----
From: "Barry A. Warsaw" <barry@python.org>
To: "Michael Meltzer" <mjm@michaelmeltzer.com>
Cc: <Mailman-Developers@python.org>
Sent: Tuesday, August 13, 2002 11:38 AM
Subject: Re: [Mailman-Developers] Scrubber.py confusion, 2.1b3

> >>>>> "MM" == Michael Meltzer <mjm@michaelmeltzer.com> writes:
>     MM> Actually I "reusing" the code from Scrubber.py in MimeDel.py
>     MM> to turn attachments into links :-) I hardwired it for image
>     MM> types but it is generic enough. Some sample output from my
>     MM> "staging":
>     MM> Name: beach.jpg Type: image/jpeg Size: 18853 bytes Desc:
>     MM> not_available Url:
>     MM> http://www.michaelmeltzer.com/pipermail/meltzer-list/attachments/200208/12/beach.jpg-0005.jpe
> Cool.  I'm using a slightly different naming algorithm for the path.
>     MM> It turned out to be a 4 line hack to filter_parts, 1 line at
>     MM> the top and 10 lines to reformat the payload, the reset came
>     MM> from save_attachment, very handle :-)
> Can you try to update it to current cvs?  If it's really a 4 line
> hack, you've got to post it. :)  I tried to write the Scrubber.py
> updates with you in mind, by factoring out some other functionality
> you might need.
>     MM> I have to admit environment is nice to work in.
> :)
>     MM> I am not sure my code it upto patch quality :-) The next step
>     MM> would be a modification to the content filter page for the
>     MM> type it should react to.
>     MM> I would also subject(Scrubber.py needs this too) that the
>     MM> filter pages list the extensions that it is allow to write. Or
>     MM> the converse the extensions it should not write,
>     MM> http://office.microsoft.com/Assistance/2000/Out2ksecFAQ.aspx. would
>     MM> be my start :-), save the masses someday :-)
> I've been thinking about this.  I vaguely remember that someone did a
> patch to support pass-or-block semantics to the filter, but I can't
> put my finger on it now.  I want to link Dan Mick's name to that, but
> does this ring a bell with anyone?
>     MM> The issue with the directory is the number of files, not a
>     MM> name clash
> Yep, I know.
>     MM> , `ls -d archives/private/listname/attachments/* |
>     MM> wc -l` > 1000 I think system performance will be
>     MM> effected. Above 10,000 I know it would(it would also be a
>     MM> problem for the http server on access). I can understand that
>     MM> keeping the attachment from each email in it own directory,
>     MM> but this way the "files version control" :-) groups them
>     MM> together for access(assuming least regency theory) and make
>     MM> cleaning out for space/inodes simple. it was just strftime
>     MM> wielded on.
> I'm not sure I followed all that, but the current Scrubber.py does add
> the date directory to the path, so I think we're good here.
> -Barry